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SUMMARY 


n 

The  research  -  -reported  here-  covers  a  number  of  different  areas.  The 
methodology  of  density  estimation  has  been  considered  with  particular 
reference  to  the  smoothed  bootstrap.  A  new  method  of  fitting  parsimonious 
additive  models  has  been  devised.  The  topic  of  statistical  integral  equations 
has  been  investigated  in  detail  and  algorithms  for  two  main  cases  of 
particular  interest  have  been  developed  and  investigated.  Applications  to 
image  analysis  have  been  considered.  Contributions  to  the  theory  of 
estimation  horn  indirect  information  have  been  made.  There  has  been 
careful  consideration  of  the  appropriate  way  to  penalise  an  edge  process 
model  in  an  image  reconstruction.  The  methodology  of  nonparametric 
discriminant  analysis,  with  particular  reference  to  the  CART  approach,  has 
been  the  subject  of  considerable  attention.  The  ICM  method  of  image 
reconstruction  has  been  studied.  A  new  method  of  image  refinement  has 
been  developed. 


KEYWORDS:  density  estimation;  smoothed  bootstrap;  nonparametric  regression; 
‘computer  algebra;  parsimonious  additives  models;  piecewise  linear  fitting; 
statistical  integral  equations;  stereology;  tomography;  EM  algorithms;  missing 
data;  indirectly  observed  images;  image  analysis;  -  information  theory;  edge 
process;  nonparametric  discriminant  analysis;  classification  and  regression  trees; 
Markov  random  fields;  iterated  conditional  modes;  image  refinement 
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1.  Introduction 

Most  of  the  work  conducted  under  the  aegis  of  the  project  has  now  been  written  up  in 
the  form  of  papers  submitted  for  publication.  These  are  listed  in  Section  10  below.  In 
this  report,  a  brief  description  of  the  work  done  will  be  given  under  a  number  of 
headings,  and  fuller  details  are  given  in  the  papers,  most  of  which  are  attached  as 
appendices.  Against  each  head  are  given  the  numbers  of  the  relevant  papers  in  the 
publication  list  The  same  numbering  system  is  used  for  the  appendices. 

2.  Density  Estimation  [1,131 

One  of  the  aims  of  the  project  was  the  extension  of  the  existing  density 
estimation  methodology  in  various  directions.  Among  these  was  the  use  of  density 
estimation  in  techniques  such  as  the  smoothed  bootstrap.  A  criterion  has  been 
developed  for  deciding  whether  smoothing  is  worth  performing  in  any  particular 
bootstrap  situation.  For  full  details,  see  [1].  One  novel  feature  was  the  use  of 
Computer  Algebra  to  solve  this  statistical  problem,  and  Professor  Silverman  gave  an 
extremely  well  received  presentation  on  this  aspect  to  a  Royal  Statistical  Society 
workshop  on  Computer  Algebra  in  Statistics. 

In  1951,  Fix  and  Hodgi's  wrote  a  technical  report  which  contained  prophetic  work 
on  nonparametric  discriminant  analysis  and  density  estimation.  The  report  introduced 
several  important  concepts  nr  the  first  time,  and  was  never  published.  It  is  not  just  of 
historical  interest,  but  contains  much  material  of  contemporary  relevance.  A 
commentary  [13]  has  been  written  placing  the  paper  in  context  and  interpreting  its 
ideas  in  the  light  of  more  modem  developments.  The  commentary  has  been  submitted 
for  publication  together  with  the  paper  itself. 

3.  Parsimonious  additive  models  [5,10] 

A  very  simple  and  powerful  new  method  for  fitting  nonlinear  regression  models  was 
devised  and  investigated  by  Professor  Silverman  in  collaboration  with  J.H.  Friedman  of 
Stanford.  The  basic  idea  is  to  fit  a  sequence  of  segmented  linear  regressions  on  single 
variables  to  the  data  and  then  to  use  a  suitable  stopping  rule  to  decide  when  to  stop 
elaborating  the  model.  Finally  a  backward  elimination  step  is  used  to  resimplify  up  to 
an  appropriate  point  The  paper  [5]  on  this  material  was  selected  by  the  editors  of 
Technometrics  to  be  the  special  discussion  paper  at  the  1988  ASA  meetings  and  will 
shortly  appear  with  discussion  and  rejoinder  in  that  journal. 

4.  Solution  of  Statistical  Integral  Equations  [63,11,15] 

A  considerable  amount  of  work  has  been  carried  out  on  the  general  topic  of  the 
solution  of  statistical  integral  equations.  The  main  aim  has  been  the  development  of  a 
general  approach  that  can  be  applied  to  any  problem  where  the  model  for  the  observed 
data  is  obtained  by  applying  a  (known)  compact  linear  operator  A  to  the  function  /  of 
real  interest  There  are  two  cases  of  main  interest  where  the  data  arise  as 
observations  of  Af  at  known  points  subject  to  error  ("regression  dependence")  and 
where  the  data  are  observations  from  a  non-homogeneous  Poisson  process  with 
intensity  Af  ("density  dependence".)  We  have  developed  methodology  for  both  of 
these  cases,  as  discussed  separately  below. 
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Regression  dependence 

Suppose  the  data  „re  of  the  form  Yi  =  Af(ti)  +  £t  where  Af{t)  =  jA(t,u)f(u)du  and 
the  Si  are  uncorrelated  errors  with  mean  zero.  A  natural  estimate  of  /  is  given  by 
constrained  penalised  least  squares,  where  one  finds  /  to  minimise 
S(f)  =  \\Af-Y\\2  +  aff"2,  subject  to  any  relevant  linear  constraints  on  /,  such  as 
positivity.  In  all  the  applications  of  interest,  positivity  is  a  constraint  on  /  and  in 
some  cases  /  is  constrained  in  addition  to  integrate  to  1.  Particular  practical  problems 
of  interest  arose  from  consultation  with  materials  scientists.  Another  practical  problem 
considered  in  detail  was  the  determination  of  the  ventilation/perfusion  distribution  over 
the  human  lung  given  data  on  inert  gas  elimination  (Evans  &  Wagner,  J.  Appl. 
Physiol.  42,  889-898,  1977). 

The  approach  adopted  was  to  apply  quadratic  programming  to  a  discretised 
version  of  S(f),  as  follows:  the  function  f  was  approximated  by  a  vector  of  values  f 
on  a  grid;  the  vector  A/(r,)  of  the  values  of  Af  at  each  of  the  values  r;  can  then  be 
expressed  by  a  simple  quadrature  rule  as  Aff,  where  Af  is  a  suitable  matrix.  The 
toughness  penalty  jf”2  is  approximated  by  a  quadratic  form  frDf.  One  then 
minimises  (Y-Kfy  (Y-Aff)  +  apDt  subject  to  fSO  and  any  other  relevant 
constraints.  The  quadratic  programming  method  used  was  that  of  Wolfe  ( 
Econometrica  27,  382-398,  1959)  which  involves  similar  manipulations  to  the  simplex 
algorithm  for  linear  programming.  The  use  of  Wolfe’s  algorithm  has  the  advantage 
that  the  final  simplex  tableau  gives  additional  information  that  is  of  use  as  explained 
below.  The  broad  conclusions  of  the  work  were  as  follows;  some  of  these  are  treated 
in  detail  in  [15].  It  is  intended  that  one  or  two  papers  will  be  written  based  on  this 
work. 

(a)  The  quadratic  programming  algorithm  terminated  in  a  reasonable  number  of  steps 
in  all  the  applications  and  simulations  tried.  For  any  particular  data  set,  the  number 
of  pivots  —  and  hence  the  time  taken  to  find,  the  solution  —  is  approximately  constant 
as  the  smoothing  parameter  a  varies. 

(b)  The  positivity  constraint  alone  is  not  sufficient  to  provide  a  properly  regularised 
solution;  some  smoothing  (Le.  a> 0  )  is  required  in  addition.  This  casts  a  little  doubt 
on  some  of  the  existing  methodology  in  this  field,  in  which  one  chooses  a  control 
parameter  to  get  a  non-negative  solution  and  then  assumes  implicitly  that  this  solution 
will  be  sufficiently  smoothed. 

(c)  By  considering  the  special  case  (linear  nonparametric  regression)  where  the 
minimisation  of  ||Y-A/(t)||  +  aff  2  can  be  carried  out  explicitly,  it  appears  that  the 
discretisation  has  a  negligible  effect,  except  where  a  is  chosen  inappropriately  small. 

(d)  The  final  QP  tableau  makes  it  possible  to  draw  approximate  Bayesian  posterior 
confidence  intervals  for  the  curve  /  with  only  trivial  additional  computational  effort. 
This  procedure  has  been  implemented  and  investigated.  One  particular  point  of 
interest  is  the  frequentist  behaviour  of  the  Bayesian  confidence  intervals,  which  has 
been  considered  (for  example  by  Wahba)  in  the  nonparametric  regression  case.  In  our 
more  general  setting,  it  is  clear  that  the  choice  of  smoothing  parameter  is  crucial  to 
this  frequentist  behaviour.  Wahba  has  suggested  that  a  smoothing  parameter  chosen 
by  generalised  cross-validation  will  give  Bayesian  posterior  intervals  which  are  also 
(pointwise)  frequentist  confidence  intervals.  In  the  case  of  ill-posed  problems,  our 
work  casts  doubt  on  the  generalisability  of  this  claim,  because  in  practice  varying  a 
between  quite  wide  limits  produces  curves  which  are  completely  different  in 
appearance  but  which  fit  the  data  almost  equally  well;  the  difference  between  the 
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solutions  lies  in  a  space  spanned  by  singular  functions  of  <4  with  extremely  small 
singular  values.  Yet  while  the  goodness-of-fit  to  the  data  remains  almost  unchanged, 
the  width  and  frequentist  coverage  probabilities  of  the  Bayesian  intervals  changes 
dramatically  This  general  behaviour  also  shows  that  it  is  more  appropriate  to  make 
use  of  prior  information  rather  than  attempting  to  choose  the  smoothing  parameter 
automatically.  One  extremely  useful  aspect  of  the  posterior  intervals  was  in 
demonstrating  to  a  materials  science  client  the  effect  of  the  ill-posed  nature  of  his 
particular  problem.  It  was  immediately  clear  that  the  detailed  question  he  was  asking 
was  not  resolvable  on  the  basis  of  the  experiment  conducted. 

Density  dependence 

Suppose  now  that  the  function  A  is  a  non-negative  function  satisfying  jA(r,u)dr  =  1 
for  all  u  and  that  the  available  data  consist  of  independent  observations  Y,  drawn  from 
the  probability  density  Af.  We  might  think  of  these  observations  as  being  ''indirect" 
observations  from  the  density  /  of  real  interest,  since  in  many  of  the  practical 
problems  of  this  type,  there  is  an  unobservable  sample  Xt  drawn  from  /  itself,  and 
each  T,  is  drawn  from  the  density  A(yJCj).  Examples  of  this  situation  in  practice 
include  the  classical  stereology  problem  of  determining  the  particle-size  distributions 
from  data  collected  on  plane  sections  through  a  composite  medium,  and  the  problem  in 
image  processing  of  reconstructing  a  section  through  the  human  body  by  means  of 
positron  emission  tomography.  Vardi,  Shepp  and  Kaufman  (/.  Amer.  Statist.  Assoc. 
80,  8-37,  1985)  describe  this  latter  problem  in  detail  and  give  an  approach  based  on 
the  EM  algorithm  that  aims  towards  a  maximum  likelihood  estimate  of  /  in  any 
problem  of  this  kind.  In  the  positron  emission  tomography  problem  they  consider  in 
detail,  the  EM  algorithm  does  not  actually  converge  in  a  reasonable  number  of  steps, 
and  so  they  propose  stopping  after  a  finite  number  of  steps  thereby  obtaining  a 
smoother  estimate  of  /  than  would  be  given  by  maximum  likelihood,  but  one  which 
depends  on  the  starting  point  of  the  iterations  and  which  is  not  a  limit  point  of  any 
iterative  procedure. 

We  have  developed  a  general  approach  in  which  a  smoothing  step  is  introduced 
between  each  EM  iteration.  The  smoothing  part  of  each  iteration  involves  very  little 
computational  effort.  A  wide  variety  of  linear  and  non-linear  smoothers  have  been 
tried  and  the  conclusion  is  that  best  results  are  obtained  by  a  simple  local  averaging 
procedure;  furthermore  the  effect  of  quite  a  small  amount  of  smoothing  is  quite 
dramatic.  The  effect  of  the  discrete  grey  level  nature  of  the  images  was  also 
considered  and  it  was  found  that  the  best  results  are  obtained  by  working  in  continuous 
values  for  the  level  of  the.  images  and  only  discretising  at  the  display  stage.  On  all  our 
empirical  evidence,  the  smoothed  EM  procedure  converges  in  a  reasonable  number  of 
iterations,  and  furthermore  the  limit  point  of  the  procedure  does  not  depend  on  the 
starting  configuration.  We  have  demonstrated  heuristically  that  the  smoothed  EM 
approach  corresponds  to  a  classical  EM  algorithm  applied  to  a  penalised  maximum 
likelihood  problem,  where  the  likelihood  is  penalised  by  a  term  depending 
quadratically  on  the  square  root  of  the  function  of  interest  The  paper  [6]  dealing  with 
the  specific  application  to  the  classical  stereology  problem  is  already  in  press;  a  more 
general  discussion,  and  several  particular  points  concerning  the  positron  emission 
tomography  problem,  is  given  in  the  paper  [11], 

It  is  also  possible  to  apply  the  smoothed  EM  approach  to  problems  of  the 
regression  dependence  kind.  Some  comparisons  have  been  made  in  [15]  between  this 
approach  and  the  quadratic  programming  method  for  one-dimensional  problems.  The 
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gcneral  conclusion  is  that  the  results  obtained  are  very  similar,  and  the  quadratic 
programming  method  has  the  advantage  of  providing  approximate  posterior  confidence 
intervals  at  negligible  cost  For  problems  in  which  /  is  a  function  on  a  higher 
dimensional  space,  the  quadratic  programming  approach  could  not  be  applied,  because 
the  computational  cost  of  each  iteration  depends  quadratically  on  the  number  of  pixels 
or  bins,  and  the  number  of  pivots  required  is  bounded  below  by  the  number  of  bins  in 
which  the  solution  is  non-zero. 

5.  Availability  of  information  in  indirect  observation  problems  [9,12] 

In  any  indirect  observation  problem,  it  is  of  interest  to  ask  how  much  information  is 
actually  available  in  a  sample  of  given  size,  as  compared  to  an  experiment  in  which 
"direct"  observations  are  available  from  the  density  function  being  estimated.  We  have 
concentrated  on  the  positron  emission  tomography  problem,  but  the  general 
methodology  is  applicable  to  any  indirect  estimation  problem  where  the  singular  value 
decomposition  of  the  integral  operator  can  be  expressed  explicitly,  and  also,  as 
explained  in  Section  6  of  the  paper,  to  a  wider  class  of  related  problems. 

Given  a  large  sample  {F,  }  of  indirect  observations,  we  consider  the  size  of  the 
equivalent  sample  [Xi )  of  observations,  whose  original  exact  positions  would  allow 
equally  accurate  estimation  of  the  image  of  interest.  Both  for  indirect  and  for  direct 
observations,  we  establish  exact  minimax  rates  of  convergence  of  estimation,  for  all 
possible  estimators,  over  suitable  smoothness  classes  of  functions.  For  indirect  data 
and  (in  practice  unobservable)  direct  data  in  a  two-dimensional  version  of  the  PET 
problem,  the  rates  for  mean  integrated  square  error  are  n~p'(p+2)  and  (n/logn)“p'(p+1) 
respectively,  for  densities  in  a  class  corresponding  to  bounded  square-integrable  pth 
derivatives.  We  obtain  numerical  values  for  equivalent  sample  sizes  for  minimax  linear 
estimators  using  a  slightly  modified  error  criterion. 

One  of  the  technical  tools  used  in  the  paper  is  an  orthogonal  series  approach 
based  on  the  singular  value  decomposition  of  the  integral  operator.  Although  this 
originally  arose  for  theoretical  reasons,  it  is  shown  in  [9]  that  it  yields  estimates  that 
are  in  a  sense  rate-optimal.  Although  this  estimator  can  only  be  constructed  in  the 
special  case  where  the  SVD  is  explicitly  available,  its  calculation  in  this  case  can  be 
carried  out  quickly,  and  so  it  was  of  interest  to  explore  its  practical  behaviour.  In  [12] 
an  investigation  of  this  kind  was  conducted.  The  method  has  the  advantages  of  speed 
and  of  independence  of  any  pixellation;  it  has  the  disadvantages  of  ignoring  the 
positivity  constraint  and  of  making  rapid  changes  in  value  more  difficult  to  achieve 
than  the  EMS  algorithm.  Nevertheless  it  is  clear  that  the  method  is  certainly  useful  as 
a  "quick  and  dirty"  approach  in  those  cases  where  the  SVD  is  tractable. 

6.  Edge  process  models  [2,4] 

One  of  the  ingredients  of  recent  methodology  in  statistical  image  restoration  is  the  idea 
of  introducing  a  system  of  "edges"  between  pixels  in  the  image.  See,  for  example, 
Geman  and  Geman  (IEEE  Trans.  PAMI-6,  721-741,  1984).  If  an  edge  is  present 
between  two  contiguous  pixels  then  they  are  not  considered  as  neighbours  in  the 
restoration  procedure.  The  use  of  such  a  process  is  likely  to  be  of  value  in  restoring 
images  which  consist  of  a  number  of  regions  within  each  of  which  the  value  varies 
smoothly.  In  penalized  maximum  likelihood  estimation  of  the  image,  the  number  and 
configuration  of  the  edges  is  controlled  by  a  penalty  term;  in  model- based  restoration 
using  Markov  random  fields  there  is  an  analogous  penalty  term  in  the  energy  function 
of  the  Gibbs  distribution  for  the  edge  process.  We  have  investigated  how  some 
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geometrical  insights  can  be  used  to  provide  penalties  for  the  various  edge 
configurations  in  a  way  that  is  roughly  independent  of  the  pixel  discretisation.  The 
penalties  we  obtained  are  consistent  over  pixels  of  different  sizes,  shapes  and 
orientations,  even  if  these  occur  in  the  same  pattern;  pixel  grids  consisting  of  pixels  of 
different  sizes  are  a  key  element  in  the  work  on  positron  emission  tomography 
discussed  in  [11].  The  cases  of  square,  rectangular,  hexagonal  and  irregular  pixels  are 
considered. 

7.  Non  parametric  Discriminant  Analysis 

A  great  deal  of  work  has  been  carried  out  on  the  Classification  and  Regression  Tree 
(CART)  approach  to  nonparametric  discriminant  analysis.  This  has  not  yet  been 
written  up  in  the  form  of  papers,  but  will  first  appear  in  detail  in  P.C.  Taylor’s  (1989) 
PhD  thesis 

We  have  designed  and  implemented  an  entirely  novel  method  of  displaying  the 
classification  tree  making  use  of  sophisticated  colour  graphics.  This  method  produces 
"block  tree  diagrams"  which  have  great  practical  value  in  explaining  what  the 
procedure  is  doing,  and  methodological  value  in  pointing  out  ways  in  which  the 
current  algorithm  is  working  well  and  badly.  .Another  area  of  attention  has  been 
alternative  splitting  criteria  with  particular  reference  to  the  problems  raised  when  there 
is  a  large  number  of  classes  to  be  considered.  In  addition  to  the  Gini  criterion  and  the 
twoing  procedure  suggested  by  Breiman  et  a 1,  we  have  investigated  five  new  suggested 
splitting  criteria  some  (but  not  all!)  of  which  appear  to  have  great  promise.  The  next 
main  contribution  has  been  in  looking  at  adaptive  "anti  end-cut  factors"  which  work  to 
prevent  the  introduction  of  large  numbers  of  splits  that  remove  very  small  parts  of  the 
data.  Such  factors  need  to  depend  adaptively  on  such  things  as  the  number  of  cases  at 
the  current  node  and  the  number  of  species  represented,  and  these  ideas  have  been 
incorporated  into  the  procedures. 

Further  refinements  have  been  made  to  the  display  program  for  presenting  the 
successive  splits  carried  out  by  CART  on  a  colour  display.  In  particular  ideas  for 
dealing  with  categorical  variables  have  been  included  in  the  package.  The  algorithm 
itself  has  been  enhanced  to  include  a  surrogate  splits  option,  which  allows  the  program 
to  cope  with  missing  values  in  the  predictor  variables.  Surrogate  splits  can  also  be 
used  to  rank  the  importance  of  each  predictor  variable  in  terms  of  their  discriminatory 
usefulness.  Ongoing  activity  is  in  two  main  areas.  The  first  is  aimed  at  reducing  the 
amount  of  pruning  required  to  create  a  classification  tree.  The  second  is  an  attempt  to 
detect  hierarchies  in  the  ciass  structure.  For  example,  when  discriminating  between 
different  types  of  vehicle,  we  may  hope  that  tracked  and  wheeled  vehicles  could  be 
distinguished  near  the  root  of  the  tree. 

8.  Image  Refinement  [3,14] 

A  consequence  of  the  use  of  a  statistical  model  for  a  true  scene  is  the  possibility  of 
producing  restored  images  on  a  finer  pixel  grid  than  that  on  which  the  signal  is 
originally  collected.  This  fact,  pointed  out  by  Jennison  in  the  discussion  of  Besag’s 
paper  (/.  Royal  Statist.  Soc.,  48,  288-289),  has  formed  the  basis  of  a  very  promising 
avenue  of  research.  There  are  immediate  potential  applications  m  LANDSAT  imaging 
and  other  forms  of  aerial  photography  where,  because  of  the  large  pixel  size,  the 
proportion  of  "mixed"  pixels  can  be  disturbingly  high;  a  proper  subdivision  of  such 
pixels  into  regions  of  more  than  one  type  should  improve  classification  rates 
considerably.  More  generally,  methods  which  do  not  impose  the  unrealistic 
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assumption  that  a  scene  is  uniform  within  each  pixel  offer  a  possibility  of  more 
accurate  restoration  in  all  image  problems. 

Work  in  this  area  has  been  earned  out  with  the  assistance  of  M.  Jubb,  whose 
(1989)  PhD  Thesis  will  contain  a  full  account  of  progress  thus  far.  It  became  apparent 
at  an  early  stage  of  our  work  that  there  would  be  insuperable  computational  problems 
associated  with  pixel  subdivision  beyond  a  2x2  refinement.  However,  the  limiting 
case  in  which  arbitrary  boundaries  are  allowed  within  each  pixel  was  found  to  be  quite 
tractable.  Our  initial  work  was  to  implement  a  method  for  computing  an 
approximation  to  the  solution  of  this  limiting  case  problem  in  which  straight  line  edges 
were  allowed  within  each  pixel.  This  procedure  was  found  to  be  very  effective  in  the 
presence  of  low  levels  of  additive  noise;  details  and  examples  appear  in  [3]. 

Further  work  has  tackled  the  same  problem  in  the  presence  of  greater  noise 
levels.  An  important  technique  for  producing  starting  values  which  can  then  be 
updated  iteratively  by  the  edge  fitting  algorithm  is  signal  aggregation:  by  adding 
together  signals  from  groups  of  pixels  a  signal  on  a  coarser  pixel  grid  but  with  greater 
signal  to  noise  ration  is  obtained.  A  cascade  algorithm,  in  which  a  series  of 
restorations  are  obtained  at  successively  lower  levels  of  signal  aggregation  has  been 
developed.  This  has  been  found  to  produce  good  restorations  for  very  noisy  data 
which  existing  methods  fail  to  handle  at  all  well.  Details  of  this  algorithm  are  given 
in  [14]. 

Our  research  in  this  area  is  still  continuing.  In  particular,  we  are  considering  the 
additional  problems  associated  with  grey-level  data  and  true  images  which  contain 
objects  separated  by  sharp  boundaries  but  also  with  smooth  changes  in  colour  within 
an  object. 

9.  Markov  random  field  algorithms  for  image  restoration  [7] 

The  iterated  conditional  modes  (I CM)  approach  of  Besag  and  the  annealing 
approach  of  Geman  and  Geman  have  been  investigated.  A  suite  of  programs  and 
algorithms  implementing  these  approaches  to  image  analysis  was  written  in  order  to 
give  a  basis  for  experimentation  and  improvement  A  large  simulation  study  was  then 
carried  out  on  some  aspects  of  these  approaches.  Qne  particular  aspect  of  interest  has 
been  the  investigation  of  the  appropriate  choice  of  interaction  parameters)  in  the 
Markov  random  field  model  as  used  in  the  prior  for  the  images.  A  theoretical 
argument  demonstrates  that  an  appealing  procedure  is  to  weight  diagonal  neighbours  of 
each  pixel  by  2-*  the  amount  used  to  weight  direct  neighbours.  Such  a  scheme  should 
produce  reconstructions  that  are  are  largely  unaffected  by  the  wa^in  which  the  pixel 
grid  is  placed  on  the  true  underlying  image.  The  broad  conclusions  of  the  simulation 
study  were  that  worthwhile  gains  can  be  achieved  using  an  ‘optimal’  value  of  of 
Besag’s  parameter  f}  rather  than  the  portmanteau  value  1.5.  and  that  in  the  absence  of 
specific  prior  knowledge  about  the  corrupted  scene  a  second  order  neighbourhood  $ 
model  with  down-weighted  diagonals  should  be  used,  for  example  the  one  suggested 
by  the  theoretical  arguments  referred  to  above.  For  full  details  see  [7], 
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Summary 

The  bootstrap  and  smoothed  bootstrap  are  considered  as  alternative  methods  of 
estimating  properties  of  unknown  distributions  such  as  the  sampling  error  of  parameter 
estimates.  Criteria  are  developed  for  determining  whether  it  is  advantageous  to  use  the 
smoothed  bootstrap  rather  than  the  standard  bootstrap.  Key  steps  in  the  argument  leading 
to  these  criteria  include  the  study  of  the  estimation  of  linear  functionals  of  distributions 
and  the  approximation  of  general  functionals  by  linear  functionals.  Consideration  of  an 
example,  the  estimation  of  the  standard  error  in  the  variance-stabilized  sample  correlation 
coefficient,  elucidates  previously-published  simulation  results  and  also  illustrates  the  use 
of  computer  algebraic  manipulation  as  a  useful  technique  in  asymptotic  statistics.  Finally, 
the  various  approximations  used  are  vindicated  by  a  simulation  study. 

Some  kev  words  :  Bootstrap;  Computer  algebra;  Density  estimation;  Kernel;  Resampling;  Smoothed  bootstrap. 


1.  Introduction 
IT.  The  standard  bootstrap 

The  bootstrap  is  an  appealing  nonparametric  approach  to  the  assessment  of  errors  and 
related  quantities  in  statistical  estimation.  The  method  is  described  and  explored  in  detail 
by  Efron  t ' 979,  1982).  A  typical  context  in  which  the  bootstrap  is  used  is  in  assessing 
the  sampling  mean  squared  error  a(F)  of  an  estimate  9t  X, , . . . ,  X, )  of  a  parameter 

9{  F)  based  on  a  sample  X, . X„  drawn  from  an  unknown  distribution  F.  If  F  were 

known,  a  might  be  most  easily  estimated  by  repeatedly  simulating  samples  from  F.  The 
standard  bootstrap  technique  is  to  estimate  at  F)  by  the  sampling  method,  but  with  the 
samples  being  drawn  not  from  F  itself  but  from  the  empirical  distribution  function  F„ 

of  the  observed  data  X . X..  A  sample  from  F„  is  generated  by  successively  selecting 

uniformly  with  replacement  from  {X, . X.}  to  construct  a  bootstrap  sample 

{XT, ....  XT).  For  each  bootstrap  sample,  the  estimate  9(  XT, ....  XT)  of  the  quantity 
9t  F„)  is  calculated.  Since  arbitrarily  large  numbers  of  bootstrap  samples  can  be  construc¬ 
ted,  at  F„)  can  easily  be  estimated  to  any  reasonable  required  accuracy  from  the  simula¬ 
tions.  The  quantity  a(  F„ )  is  ihen  used  as  an  estimate  of  at  F). 

The  bootstrap  method  thus  consists  of  two  main  elements,  which  are  often  confused. 
There  is  first  the  idea  of  estimating  a  functional  a(F)  by  its  empirical  version  a(F„ I, 
and  secondly  the  observation  that  atF„)  can  in  very  many  contexts  be  constructed  by 
repeated  resampling  from  the  observed  data.  The  resampling  idea  is  an  extremely 
important  one,  but  it  has,  perhaps,  been  overstressed  at  the  expense  of  the  underlying 
estimation  step.  Once  the  two  steps  are  conceptually  separated  it  becomes  easier  to  gain 
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a  fuller  understanding  of  how  the  bootstrap  actually  works.  In  particular  it  becomes  clear 
that  there  is  nothing  special  about  estimating  functionals  a(F)  that  are  themselves 
sampling  properties  of  parameter  estimates;  the  bootstrap  idea  can  be  applied  to  any 
functional  a(F)  of  interest. 


1-2.  The  smoothed  bootstrap 

Because  the  empirical  distribution  F„  is  a  discrete  distribution,  samples  constructed 
from  F„  in  the  bootstrap  simulations  will  have  some  rather  peculiar  properties.  All  the 
values  taken  by  the  members  of  the  bootstrap  samples  will  be  drawn  from  the  original 
sample  values,  and  nearly  every  sample  will  contain  repeated  values.  The  smoothed 
bootstrap  (Efron,  1979)  is  a  modification  to  the  bootstrap  procedure  to  avoid  samples 
with  these  properties.  The  essential  idea  of  the  smoothed  bootstrap  is  to  perform  the 
repeated  sampling  not  from  F„  itself,  but  from  a  smoothed  version  F  of  F„.  Two  possible 
versions  of  the  smoothed  bootstrap  will  be  described  in  more  detail  below;  whatever 
method  of  smoothing  is  used,  the  net  effect  of  using  the  smoothed  bootstrap  is  to  estimate 
the  functional  a{F)  by  a(F). 

The  main  aim  of  this  paper  is  to  investigate  some  properties  of  the  smoothed  bootstrap, 
in  order  to  give  some  insight  into  circumstances  when  the  smoothed  bootstrap  will  give 
better  results  than  the  standard  bootstrap.  As  an  important  by-product,  the  value  of 
computer  algebraic  manipulation  as  a  tool  in  asymptotic  statistics  will  be  demonstrated. 

Efron  ( 1982)  considered  the  application  of  the  bootstrap,  and  various  other  techniques, 
to  the  estimation  of  the  standard  error  of  the  variance-stabilized  transformed  correlation 
coefficient.  He  illustrated  by  direct  simulation  that  in  a  particular  case  a  suitable  smoothed 
bootstrap  gave  better  estimates  of  standard  error  than  the  standard  bootstrap.  We  shall 
discuss  Efron's  example  later  in  the  present  paper  and  demonstrate  how  his  results  can 
be  elucidated  and  extended  by  using  a  suitable  approximation  argument. 

Before  going  on  to  discuss  the  estimation  of  general  functionals  a(F),  we  shall  first 
consider  the  estimation  of  functionals  a  that  are  linear  in  F.  For  such  functionals  we 
shall  obtain  simple  sufficient  conditions  under  which  using  the  smoothed  bootstrap  can 
decrease  the  mean  squared  error  in  the  estimation  of  a(  F). 

We  close  this  section  by  giving  details  of  the  two  kinds  of  smoothed  bootstrap  considered 

in  later  discussion.  Suppose  ,Y, . Y,  is  a  set  of  r-dimensional  observations  drawn 

from  some  r-variate  density  /  and  that  V  is  the  variance  matrix  of  /  or  a  consistent 
estimator  of  this  variance  matrix,  such  as  the  sample  variance  matrix  of  the  data.  Choose 
a  kernel  function  K  such  that  K  is  a  symmetric  probability  density  function  of  an  r-variate 
distribution  with  unit  variance  matrix,  for  example  the  standard  unit  r-variate  normal 
density. 

Define  the  kernel  estimate  fax)  of  /<  x)  by 

fax)  =  i  V!‘-'rt  "’h  £  K{h"  V'U  x  -  X,)},  ill) 

i  »  I 

and  the  shrunk  kernel  estimate  fatx)  by 

A,lxi  =  1 1  -  h;)iyU<  1  -r  h:V-x\.  (1-2) 

Density  estimates  in  general  are  discussed,  for  example,  by  Silverman  (1986).  The 
smoothing  parameter  h  determines  the  amount  by  which  the  data  are  smoothed  to  provide 
estimates.  Estimates  of  the  form  (1-2)  have  the  property  that  the  density  /,, ,  has  the  same 
variance  structure  as  the  original  data,  if  V  is  taken  to  be  the  sample  variance  matrix. 
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Given  any  functional  a(F)  of  an  r-variate  distribution  F,  the  unshrunk  smoothed 
bootstrap  estimate  of  a(F)  js  defined  to  be  a(Fh)  and  the  shrunk  smoothed  bootstrap 
estimate  is  a(F,.,),  where  F,  and  F,,  are  the  distribution  functions  corresponding  to 
/»  and  fhl  respectively.  It  is  easy  to  simulate  either  from  /»  or  from  fk by  sampling  with 
replacement  from  the  original  data  and  perturbing  each  sampled  point  appropriately; 
for  details  see  Efron  ( 1979)  or  Silverman  ( 1986,  §  6.4).  Hence  values  of  a<  F.)  and  at  F, , ) 
can  be  obtained  in  practice  by  simulation  if  necessary. 


2.  Linear  functionals 

In  this  section  we  consider  the  estimation  of  a  linear  functional  -41  F).  Because  A  is 
linear,  standard  calculus  demonstrates  the  existence  of  a  function  a(t)  such  that 

A(F)  =  j  a(t)  dF(t). 

The  standard  bootstrap  estimate  A0(  F)  will  satisfy 

A„(F )  =  At  F„)  =  |  a(t)  dF„U)  -  n~'  £  a(  X,). 

The  unshrunk  smoothed  bootstrap  estimate  Ah(F)  will  satisfy 

A„(F)  =  |  a(  ()/(,(/)  dt, 

and  the  shrunk  smoothed  bootstrap  estimate  A*,, IF)  will  satisfy 

AhJF)  =  J  a{nfhJt)  dt. 

with  J\  and  fK,  as  defined  in  (11)  and  (1-2)  above. 

In  the  discussion  that  follows  we  assume  that  the  function  a  has  continuous  derivatives 
of  all  orders  required.  All  unspecified  integrals  are  taken  over  the  whole  of  r-dimensiona! 
space.  Assume  that  V  is  fixed  and  define  the  differential  operator  Ov  by 


DvO  =  Y  V  V„d1a/dx,  dx,. 

t*l  /  *  I 

Our  first  theorem  gives  a  criterion  for  smoothing,  without  shrinkage,  to  be  of  potential 
value  in  the  bootstrap  estimation  process. 

Theorem  1.  Suppose  aiX)  and  DvOlX)  are  negatively  correlated.  Then  the  mean 
squared  error  of  Ai,l  F)  can  be  reduced  below  that  of  An(  F)  by  choosing  a  suitable  h>0. 

Proof.  Assume  without  loss  of  generality  that  A(F)=0.  by  replacing  alti  by 
a(t)-J  aix)f(x)  dx  if  necessary.  By  this  assumption, 

mse  {.4n( F)}  =  £(Afct F):}  =  var  {.4h(  F)} ~[E ;'.4i,(  F)}]:.  (2-11 

Now,  by  some  easy  manipulations.  .4*1  FI  =  n~'  ^  wi.X', ),  say,  where  the  sum  is  over 
i  =  l . n.  and  where 

at  t  )h  "I  V'~K{h  ■’  V  !t  t -x  i)  dt  -  I  Ki()aix  —  hV‘()  d(  (2-2) 


Wt  X  I  = 
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on  making  the  substitution  i  =  x  -t-  AV'-f.  A  Taylor  expansion  gives 

a(x+hV *f)  =  a(jt)  +  h(  V‘f)T{a(x)}:+^/r(  Ha(x)(  V‘f)  +  OUi4), 
where  Ha(x),j  =dla(x)/dx,  dxt.  By  our  assumptions  on  the  kernel  K  it  follows  that 

w(x)  =  a(x)  +  ihzDva(x)  + CHh*),  (2-3) 

E{Ah(F)}  =  E{w(X)}  =  [h2  J  f(x)Dva{x)  dx  +  0(h4),  (2-4) 

since  f  a(x)f(x)  dx  =  0.  Also,  since  X, . X,  are  independent, 

n  var  {A/,(F)f  =  var  {w(X)}  =  j"  a(x)2f(x)  dx  +  h:  J"  a(x)Dva(x)f(x)  dx+  Olh4) 

(2-5) 

using  (2-3).  Combining  ( 2-4)  and  (2-5)  gives  the  mean  squared  error 

MSE  {A(,(F)}  =  n~'  [  a(x)zf(x)  dx  +  n~‘ h:  [  a{x)DvtHx)f(x)  dx  +  0(h4).  (2-6) 


For  fixed  n ,  the  equation  (2-6)  demonstrates  that,  under  the  assumption  that  <j(X)  and 
Dvo(X)  are  negatively  correlated,  the  mean  squared  error  of  Ai,(F)  will,  at  least  for 
small  h,  be  smaller  than  that  of  A0(F),  completing  the  proof  of  the  theorem.  Z 

The  next  theorem  gives  the  corresponding  criterion  for  smoothing  with  shrinkage  to 
lead  to  more  accurate  bootstrap  estimation.  Define  a*(X)  by 

am(X)  =  DMX)-X.  Va(X). 

Theorem  2.  Suppose  a[X)  and  a*tX)  are  negatively  correlated.  Then  the  mean  squared 
error  of  Ah  ,t  F)  can  be  reduced  below  that  of  Ao ,( F)  =  A„(  F)  by  choosing  a  suitable  h>  0. 

Proof.  As  before  assume  without  loss  of  generality  that  A(  F)  =0.  We  have  by  similar 
manipulations  to  those  used  above,  Ahl{F)  =  n~'1  w*(X,),  where 

w*(x)  =  (l  +h:)i'  |  a(t)h'r|  Y|'!x[h-'  V'${x-U-i-/r)-*f}]  dt 
=  |  a{U+h:r-'(.x  +  AV!f)}K(f)</£ 

on  making  the  substitution  f  =  (.x-t-hV-f )/( 1  +  h2)-.  Now,  for  h  small,  ( 1  +  /i:)'-- 1  -;h:, 
so 


w  *  ( .x )  —  J  aix  +  hV^-ih:x)K{t)  df 

A  Taylor  expansion  of  a  about  .x,  and  our  assumptions  on  the  kernel  K  give 

»’(.t)  =  aU|+|liV(r|tOlli‘).  (2-7) 

Now  we  have 


E{AhJF)}  =  £{w*(X)} 


f(x)amlx)  dx  +  Olh4). 
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and,  on  using  (2-7), 

n  var{Ah,(F)}  =  J"  a(x)2f(x)  dx  +  h''  |  a(x)a*(x)f(x)  dx  +  0( h“). 

The  proof  of  Theorem  2  is  completed  in  the  same  way  as  that  of  Theorem  1.  G 

As  a  simple  illustration,  consider  the  estimation  of  the  sixth  moment  {  x6f(x)  dx  of  a 
univariate  density.  It  is  not  immediately  dear  whether  smoothing  is  worthwhile  in  this 
case.  In  the  notation  used  above,  a(x)  =  x ",  Dva(x)  =  30  Vx *  and  a*(x)  =  30  Vx^-bx6. 
It  follows  that,  setting  p.,  =  EX', 

cov  {a(X),  a*(X)\  =  -6/X|2  +  30  V/i|o  +  6/is-30 

If,  for  example,  X  has  a  normal  distribution  with  mean  zero  and  variance  V,  we  have 
V.v  =  V‘2->{2j)\lj\  and  hence  cov  {a(X),  a*(X))  =  -34020  V*<0. 

Thus  a  shrunk-smoothed  estimate  Jx4/>..<(*)  dx  will  always,  for  a  suitably  chosen  value 
of  h,  give  a  more  accurate  estimate  of  E(Xi)  than  will  the  raw  sixth  moment  if  X  is 
drawn  from  a  normal  distribution.  Similar  calculations  for  other  distributions  show  that 
the  same  conclusion  holds  under  a  wide  variety  of  distributional  assumptions  for  X. 

The  results  obtained  by  applying  the  criteria  can  sometimes  be  a  little  surprising. 
Suppose  X  is  drawn  from  a  standard  normal  distribution.  Application  of  the  criterion 
for  estimation  by  unshrunk  smoothing  demonstrates  that,  for  small  h,  this  wiil  have  a 
deleterious  effect  in  the  estimation  of  either  E(X*)  or  E(X!)  alone.  However,  for  the 
linear  combination  of  moments  E(X*  -  cX1),  unshrunk  smoothing  will  be  worth  perform¬ 
ing  provided  c  >  6.  Details  of  this  somewhat  counter-intuitive  result  are  left  to  the  reader 
to  reconstruct. 

We  do  not,  in  this  paper,  devote  much  attention  to  the  question  of  how  much  smoothing 
should  be  applied  in  cases  where  smoothing  is  worth  performing;  that  problem  is  left 
for  future  work.  However,  the  last  example  of  this  section  demonstrates  that  the  question 
of  how  much  to  smooth  can  be  a  rather  delicate  one.  In  this  example,  let  denote  the 
density  of  the  normal  distribution  with  mean  zero  and  variance  cr:.  Let 


A,(F ) 


f)  df(t). 


and  suppose  that  the  quantity  e  converges  to  zero  as  the  sample  size  increases.  Assume 
that  F  has  a  smooth  density  /  with  derivatives  of  ail  orders  required.  Consider  the 
estimation  of  A,{F)  by  the  unshrunk  smoothed  estimator  Ah(F),  constructed  using  the 
normal  density  as  the  kernel.  We  shall  investigate  the  optimal  large-sample  behaviour  of 
the  smoothing  parameter  h.  Assume  throughout  that  h  is  small  for  large  n  and  that /( 0)  >0. 

Setting  S2  =  h~+  e:  and  performing  some  simple  manipulations,  we  have 


A,(F) 


-  J  d>f 


(t)/h(tl  dt  =  n-'Vd>s(X,). 


Hence,  substituting  u  =  tS  and  performing  a  Taylor  series  expansion, 

£{Afc(F)}=|  <bl(r)f(t)dt  =  J  d>lu)ftu6)  du  =f<0)+lS:f'(0)  +  O(S-‘). 
Since,  by  a  similar  argument, 

X,(F)  =  j  d>,(r)f(t)  dt  =  f(0)  +  -:e:f"(0)  +  Ol ej), 
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it  follows  that 

E{A„(F )}  -  A.{F)  =  kh'-fX  0)  +  OtS4). 


By  standard  arguments 

var{A»(F)}  =  n "  var  {<MX)}  =  n"/(0)/(2S^){l  +  0(5)}. 

Thus  the  mean- squared  error  of  A„(F)  will  be,  asymptotically,  given  by 

MSE  {A|,(F)}— n'1/(0)/(25iri)+j/i'*/”(0)2  =  n'l/(0)/(2Sir')+j(5‘  — 

where  the  terms  neglected  are  of  order  n"'  +  56.  This  approximate  mean  squared  error 
is  a  convex  function  of  5,  and  its  minimizer  will  satisfy  S3(  8:  -  e~)  =  C(f)n~',  where 
C(f)  =/(0)/{2iAf(Q)1},  or,  in  terms  of  h  and  e, 

(1  +  h2/  e2)s,2h2/  e2  =  C(/)n''e*5.  (2-8) 

Denote  by  <b(R)  the  root  in  [0,  =0)  of  the  equation 

(1  m,/r)3/V=F; 

then  by  simple  calculus  tbi  R )  ~  R-  as  R  -*  0,  and  <^(  F )  —  F  *' 5  as  R  -*  <x>.  The  asymptotically 
optimal  h  for  the  estimation  of  A,  will  satisfy,  from  (2-8), 

fc„pl=  e\li{C(f)n~[  e~3}. 

If  n " 's  "s -*oo  then  hOIK~  eC(f)',3n~''3e~'  =  C(f),,sn~''s. 

Standard  density  estimation  theory  (Parzen,  1962)  shows  that  this  is  the  asymptotically 
optimal  smoothing  parameter  for  the  estimation  of  the  density  at  zero.  Thus,  in  this  case, 
the  best  estimate  of  A,  will  be  based  on  the  best  estimate  of  the  density. 

Unfortunately  this  will  by  no  means  always  be  the  case.  If  n~'e ~s-*0,  we  will  have 

and  if  n‘'e  5  -»  a,  where  0<  a  <  x,  /iop,  ~  ei i»{aC(/)}. 

In  neither  of  these  cases  will  it  be  optimal  to  construct  an  optimal  estimate  of  /  in 
order  to  estimate  A,(f\,  since  the  optimal  choice  of  h  will  be  smaller,  in.  order  of 
magnitude  in  the  first  case,  than  that  required  for  the  estimation  of  /  itself.  Thus  the 
optimal  estimate  of  A,{F)  will  be  based  on  an  undersmoothed  estimate  of  the  underlying 
density.  This  example  is,  of  course,  rather  artificial,  but  it  does  illustrate  the  likely  difficulty 
of  obtaining  general  rules  for  deciding  how  much  to  smooth  when  estimating  functionals 
of  a  density.  Even  in  cases  where  smoothing  is  advantageous,  the  amount  of  smoothing 
required  may  be  quite  different  from  that  needed  for  the  estimation  of  the  density  itself. 


3.  More  general  functionals 
31.  Linear  approximation 

In  this  section,  the  work  of  §  2  is  extended,  by  considering  local  linear  approximations, 
to  more  general  functionals  of  an  unknown  distribution.  When  an  explicit  bootstrap 
method  is  being  used  the  functional  being  estimated  is  unlikely  to  be  linear,  and  so  a 
more  general  theory  is  necessary.  Local  linear  approximations  to  functionals  of  distribu¬ 
tions  have  also  been  used  by  Hinkiev  &  Wei  1 19841  and  Withers  (1983). 
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Consider  the  estimation  of  a  functional  a{  F„)  of  an  unknown  distribution  F0  underlying 
a  set  of  sample  data.  Suppose  that  a  admits  a  linear  von  Mises  expansion  about  F,  given 
by 

at  F)^a(  F,f  + Al  F  -  F„),  ( 3.- 1 ) 

where  the  linear  functional  A  is  representable  as  an  integral 

A(F-F„)  =  J  a(t)d(F -  F0)U).  (3-2) 

A  detailed  discussion  of  differentiation  of  functionals  and  general  von  Mtses  approxima¬ 
tion  is  given  by  Femholz  (1983).  The  precise  accuracy  of  the  expansion  (31)  depends 
on  the  detailed  properties  of  a,  but  the  error  will  in  general  be  of  order  sup|F-  F0|:. 

'  The  expansion  (3- 1 )  gives  an  obvious  approximation  to  the  bootstrap  estimate  of  a(  F„). 
If  F  is  an  estimate  of  F0,  then  we  will  in  general  have,  provided  sup|F-  F„|  is  Op(n~’-), 

a(F)  =  a(  F0)  +  A(F)  -A(F0)  +  Op{n "'), 

and  so  the  sampling  properties  of  cr (  F)  will  be  approximately  the  same  as  those  of  Al  FI. 
The  criteria  of  §  2  can  then  be  applied  to  the  linear  functional  A.  If  using  an  appropriate 
smoothed  bootstrap  will  improve  the  estimation  of  A(  F„)  then,  neglecting  any  errors 
inherent  in  the  linear  approximation  (31),  the  smoothed  bootstrap  will  be  worth  using 
in  the  estimation  of  a (F„). 

3-2.  The  transformed  sample  correlation  coefficient 
In  this  section  we  consider  application  of  the  linear  approximation  procedure  to 
estimation  of  the  sampling  standard  deviation  of  the  variance-stabilized  sample  correlation 
coefficient.  Suppose  F,  is  a  bivariate  distribution  with  mean  zero  and  correlation.coefficient 
p,  and  let  (  =  tanh"'  p.  Let  r  be  the  computed  sample  correlation  coefficient  based  on  a 
sample  of  n  independent  observations  from  F0,  and  let  z  =  tanh"1  r  be  the  sample  estimate 
of  f.  Then  the  functional  of  interest  is  a„\  F„)  =  {var  ( r  1 .  Efron  ( 1982)  devoted  consider¬ 
able  attention  to  the  estimation  of  or„(  F„)  by  a  variety  of  methods,  including  the  smoothed 
bootstrap,  for  the  specific  case  of  F„  bivariate  normal,  with  marginals  of  unit  variance 
and  p  =  i,  and  for  sample  size  n  =  14. 

A  key  step  in  our  investigation  of  the  estimation  of  a„(.F,)  will  be  an  approximate 
formula,  given  by  Kendall  &  Stuart  (1977.  p.  251).  Let 
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where  p„  is  the  l/,y)th  moment  given  by  p,,  =J  x',.xA  dF., i.r).  Here  and  subsequently  in 
this  section  unsubscripted  letters  v  will  denote  vectors  I  x, ,  xO.  Kendall  &  Stuart  give 

a,lFi)  =  n’-slFi)  +  Oln"'i  ;), 

so  that  estimation  of  a„(  F>r  is  approximately  equivalent  to  that  of  ar(  F„). 

Consider  now  the  calculation  of  the  function  air)  defined  in  (3-2).  For  fixed  r  let  5, 
be  the  distribution  function  of  a  point  mass  at  r  and,  for  any  e  >  0  let  F,  be  the  improper 
distribution  F„~  eS,.  Then  simple  calculus  combining  i3-li  and  ( 3  -  2 )  gives 

PI  t )  =  [l  di  de  la(  F. )],  ,,,.  (3-41 
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Our  functional  a(F)  is  defined  for  improper  distributions,  as  weli  as  for  probability 
distribution  functions,  and  hence  there  is  no  need  when  calculating  ait)  to  consider  the 
more  complicated  perturbation  e(  S,  -  F0)  to  F0  used  by  Hinkley  &  Wei  ( 1984).  The  actual 
algebraic  manipulations  required  in  the  calculation  of  a(t)  from  (3-4)  and  (3-3)  are 
extremely  laborious.  However,  it  is  relatively  easy  to  write  a  program  in  a  computer 
algebraic  symbolic  manipulation  language,  such  as  macsyma,  to  perform  the  necessary 
differentiations  and  substitutions.  The  function  a(t )  itself  is  a  fourth-order  polynomial 
in  t,  and  t-  whose  coefficients  depend  on  the  moments  of  F0.  It  is  only  used  as  an 
intermediate  step,  in  the  special  cases  considered  below,  in  the  calculation  of  the  criteria 
derived  from  Theorems  1  and  2,  and  the  calculation  of  these  criteria  was  also  performed 
by  computer  algebra.  Further  details  of  the  manipulations  are  available  from  the  authors. 

To  complete  this  section  we  consider  the  results  of  the  application  of  the  computer 
algebraic  manipulation  procedure  to  the  functional  (3  3)  for  two  special  cases.  Further 
details  of  the  results  discussed  will  be  given  in  §  3-3  below.  Let  ASB(F0)  be  the  criterion 
obtained  from  Theorem  2  for  the  shrunk  smoothed  bootstrap  to  be  advantageous  in  the 
estimation  of  the  functional  AiF0).  Recall  that  ASB(F,)<  0  means  that  some  smoothing 
at  least  is  worthwhile. 

Suppose,  first,  that  the  distribution  of  the  data  can  be  reduced  by  an  affine  transforma¬ 
tion  to  a  radially  symmetric  distribution  FT  Without  loss  of  generality  it  can  be  assumed 
that  Ft  has  unit  marginal  variances.  Let  R  be  the  radial  component  of  Ft,  and  denote 
by  s,  the  jth  central  moment  of  R:.  Computer  algebra  shows  that  the  criterion  ASB(  F0) 
reduces,  in  this  case,  to 


<4sb(  F,)0O'-' =  -{3s4 -r  (4- 3s:)s,  +  ri-2s:-f  24s,  +  16J/32,  (3-5) 

where  (3n  is  the  positive  quantity  F,)-1.  Using  the  standard  inequality  rj«  s-s».  we  have 

-32.4sb(  F>)0 3  3j4-4sjs-l-3siS:,:  +  si  +  2s;  +  24s3+  16 

=  3(s,-|sJ-  :  -  2s;/3):  +  jS;  +  68s;/3  +  16  =*  16. 

This  gives  the  general  conclusion  that  AS8l  F>)  s  for  any  distribution  F„  which  can 
be  affinely  transformed  to  radial  symmetry. 

Another  class  of  distributions  for  which  ,4SB(  F„)  is  guaranteed  not  to  be  positive  is 
the  class  for  which  a  particular  affine  transformation  of  F„  to  unit  variance-covariance 
matrix  yields  a  distribution  with  independent  marginals.  Let  X  be  a  random  vector  with 
distribution  F„,  and  let  cr)  =  var  ( X, ),  cr;  =  var  ( X,)  and  p  =  corr  (X, ,  X;).  Define  a  matrix 
S  by 


here  the  power  |  denotes  the  symmetric  positive-definite  square  root.  Define  a  bivariate 
distribution  F*  by  F*(u)  =  F,(Su)  for  all  2-vectors  u.  A  random  vector  Y  =  S"'X  with 
distribution  F*  and  unit  variance-covariance  matrix  can  be  obtained  by  first  rescaling 
the  marginals  of  X  to  have  unit  variance  and  then  rescaling  the  principal  components 
ol  the  resulting  vector  to  have  unit  variances.  If  this  natural  affine  transform  of  F,  has 
independent  marginals,  then  an  argument  given  in  §3-3  below  demonstrates  that 
4SB(  F>)  s  0,  with  equality  only  if  X  has  a  uniform  discrete  distribution  giving  probability 
i  to  each  of  four  points. 


477 


The  bootstrap'.  To  smooth  or  not  to  smooth  ? 

In  summary,  we  have  derived  the  following  conclusion.  Provided  all  the  approximations 
we  have  made  are  reasonable,  using  a  shrunk  smoothed  bootstrap,  with  an  appropriate 
smoothing  parameter,  will  give  improved  estimation  of  a„(  Faj  over  that  obtained  by  the 
standard  bootstrap,  if  either  Fa  is  an  affine  transformation  of  a  radially  symmetric 
distribution  or  F0  is  an  affine  transformation,  of  a  particular  kind,  of  a  distribution  with 
independent  marginals,  and  F0  is  not  a  uniform  four-point  discrete  distribution.  In 
practice  the  underlying  distribution  F0  will  not  be  known.  An  obvious  topic  for  future 
investigation  is  the  construction  of  empirical  versions  of  the  criteria  of  Theorems  1  and 
2,  on  the  basis  of  which  a  decision  whether  or  not  to  smooth  can  be  made  for  each  data 
set  encountered.  Some  preliminary  simulations  along  these  lines  have  been  encouraging. 

3-3.  Some  technical  details 

Throughout  this  section,  define  the  matrix  S  as  in  (3-6),  and  suppose  that  X  is  a 
random  vector  with  distribution  F„.  Let  y  =  S-,X  as  in  §3-2,  and  let  F*(  y)  =  F0(Sy) 
be  the  distribution  of  Y.  It  is  easily  seen  that  the  existence  of  an  affine  transformation 
reducing  F„  to  radial  symmetry  is  equivalent  to  the  radial  symmetry  of  the  particular 
affine  transformation  F*. 

Define  ns(u)  =  a(Su)  and  let  k,,  =  E(Y\Y’2),  . where  Y  =  S~' X.  In  both  of  the  two 
special  cases  considered  in  §3-2,  kn  =  k, ,=0,  and  computer  algebraic  manipulation 
showed  that  us(u)  reduces  to  the  staple  form 

as(u)  =  {u2tui- k12(,u2t  + ui)\p0. 

The  criterion  given  in  Theorem  2  also  reduces  to  a  simple  form  when  expressed  in 
terms  of  as.  We  have,  by  standard  calculus, 

a*(X)  =  DMX)-  X.Va(X)  =  V:ns(  Y)  -  Y.  Vas(  Y)  =  as*(  Y),  - 

say,  where  a$(u)  =  {2(  1  +  +  u;)  -4fc,,  -4u;u;}/30. 

Since,  by  definition,  a{X)  =  as(  Y),  it  follows  that 

Ase(F„)  =  cov  lalX),  a" IX)}  =  cov  {asl  Y),  a}(  Y)) 

=  E{as(Y)  +  l3ok22}a*siY)  (3-7) 

since  it  is  immediate  that  E{as{  TH  =  -/30k--. 

Suppose,  now,  that  the  distribution  of  Y  is  radially  symmetric,  so  that  YT  = 

( R  cos  0,  R  sin  0)  with  0  uniformly  distributed  on  10,  2 it).  The  form  (3-7)  for  ASB(  F0) 
can  be  expressed  in  terms  of  even  moments  of  Y  up  to  order  8,  and  each  of  these  moments 
can  be  expressed  in  terms  of  the  moments  of  R2.  For  example 

fc,2  =  £(/?4sin:@cos:0)  =  E(R*/9)  =  is2  +  4)/8, 

where,  as  in  §  3-2,  s,  =  £(/?J-2F  is  the  jth  central  moment  of  R:.  the  assumption  that 
E(  Tj)  =  E(  V;)  =  1  implies  that  R2  has  mean  2.  Performing  all  these  substitutions,  by 
computer  algebra,  yields  the  form  1 3  -  5 )  for  ASB(F0)  and  hence  the  conclusion  given  in 
§3-2  for  distributions  that  can  be  transformed  to  radial  symmetry. 

Now  suppose  that  y,  and  y,  are  independent,  but  that  Y  is  not  necessarily  radially 
symmetric.  It  will  then  be  the  case  that  k22=  E(  Y\)E(  Y\)  =  1  and  hence 

a*(“)  =  -4/ gn(uiuj-uj  -u;  + 1)  =  -4fas(u)  +  j30}. 

It  follows  that  Asg(F0)=  -4  var  {as(y)}.  Since  Y,  and  Y:  are  independent,  the  only  way 
var{as(y)}  can  be  zero  is  for  Y  to  have  the  four  point  distribution  giving  probability  j 
to  each  of  the  points  (±1,  ±1 );  otherwise  as(  Y)  has  positive  variance,  and  ASBt  F0)  <  0. 
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4.  Simulation  study 
The  discussion  in  §  3  above  involved  heavy  dependence  on  two  approximations,  one 
of  them  specific  to  the  example  under  consideration  and  the  other  a  key  feature  of  our 
proposed  general  methodology.  In  this  section,  we  investigate  both  of  these  approxima¬ 
tions  by  a  simulation  study  which  extends  the  one  carried  out  by  Efron  ( 1982,  Table 
5.2).  All  our  simulations  are  carried  out  under  the  assumptions  of  Efron's  model,  that 
F0  is  the  bivariate  normal  distribution  with  unit  marginal  variances  and  correlation  !. 
Efron  considered  only  samples  of  size  14,  though  we  consider  here  larger  sample  sizes 
as  well.  We  follow  Efron  in  using  the  values  0  and  1  for  the  smoothing  parameter  h. 

For  each  sample  size  n,  the  accuracy  of  the  bootstrap  and  smoothed  bootstrap  estimates 
of  the  sampling  standard  deviation  a„(Fa)  of  the  variance-stabilized  correlation  coefficient 
was  assessed  in  three  different  ways.  First,  a  direct  simulation  of  the  bootstrap  procedure 
itself  was  carried  out;  two  hundred  data  sets  were  generated  from  F0  and  for  each  one 
a„(  Fa)  was  estimated  by  the  usual  resampling  procedure,  using  two  hundred  resampled 
data  sets  of  size  n  in  each  case.  The  true  value  of  ar„(F0)  is  known  and  so  it  is  possible 
to  estimate  the  root  mean  squared  error  of  the  direct  bootstrap  procedures  from  our 
simulations.  The  results  thus  obtained  are  labelled  ‘direct’  in  Table  1. 

Table  1.  Estimates  of  root  mean  squared  errors  of  bootstrap  estimates  of  sampling  standard 
deviations  of  variance-stabilized  and  untransformed  correlation  coefficients-,  sample  sizes  n 
and  smoothing  parameters  h. 


n 

h 

Direct 

Variance-stabilized 

Linear 

Delta 

Direct 

llntransformed 

Linear 

Delta 

14 

0 

0  075 

0-071 

0-077 

0-070 

0-076 

0-060 

1 

0  045 

0-046 

0-057 

0-057 

0-055 

0-052 

20 

0 

0049 

0-050 

0-053 

0-046 

0-053 

0-044 

i 

0  033 

0-032 

0-037 

0-045 

0-039 

0-041 

30 

0 

0  029 

0-033 

0-033 

0-033 

0-036 

0-030 

1 

0019 

0-021 

0-022 

0-027 

0-026 

0-027 

40 

0 

0-024 

0-025 

0-025 

0-024 

0-027 

0-027 

i 

0-015 

0-016 

0-017 

0-021 

0-019 

0-020 

50 

0 

0  020 

0-020 

0-021 

0-020 

0-021 

0-019 

A 

0-013 

0-013 

0-014 

0-019 

0-015 

0-018 

100 

0 

0-011 

0-010 

0-010 

0-010 

0-011 

0-010 

A 

0-008 

0-006 

0-007 

0-009 

0-008 

0-008 

Secondly,  in  order  to  investigate  the  accuracy  of  our  linear  approximation  AM(FU), 
some  analytic  calculations  were  carried  out,  making  use  of  computer  algebra.  By  this 
means,  the  behaviour  of  the  approximation  can  be  studied  without  recourse  to  any 
simulation.  For  the  bivariate  normal  population  under  consideration,  the  standard  devi¬ 
ation  of  A*  ,(F„)  was  found  to  be  n~'(  1  +  h:)~2.  This  quantity  is  referred  to  as  the  linear' 
estimate  of  the  root  mean  squared  error  of  the  bootstrap  procedure.  Closeness  of  the 
'linear'  and  'direct'  estimates  of  root  mean  squared  error  would  vindicate  our  proposed 
procedure  of  studying  the  sampling  properties  of  the  bootstrap  by  means  of  linear 
approximations. 

Our  development  of  the  linear  approximation  involved  the  intermediate  step  of 
approximating  a„(F„)  by  n'-aiF0),  as  given  in  < 3-3 ).  This  intermediate  approximation 
raises  the  possibility  of  studying  the  sampling  properties  of  the  smoothed  bootstrap  by 
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considering  those  of  the  approximation  (3-3),  with  F„  replaced  by  /y,.  This  corresponds 
to  substituting  the  moments  of  Fhl,  which  are  easily  calculated  from  the  sample,  into 
(3-3).  By  analogy  with  §  6.5  of  Efron  (1982),  we  refer  to  this  procedure  as  the  nonpara- 
metric  delta  approximation  to  the  smoothed  bootstrap.  For  each  of  two  hundred  simulated 
samples  from  F„  this  approximation  was  calculated.  From  the  values  thus  obtained,  a 
third  estimate  of  the  root  mean  squared  error  of  the  smoothed  bootstrap  procedure  was 
found.  This  is  labelled  'delta'  in  Table  1. 

The  analogous  investigation  was  carried  out  for  the  untransformed  correlation 
coefficient  r,  in  the  context  of  the  same  bivariate  normal  model.  The  factor  ( 1  is 

omitted  from  (3-3)  in  this  case;  otherwise  the  same  algebraic  manipulations  and  simula¬ 
tions  were  performed  as  for  the  variance-stabilized  coefficient  The  'linear'  estimate  of 
the  root  mean  squared  ercor  is  now  jn“'(l  h:)~2{2  +  2h:  +  h4)-.  The  results  are  presented 
in  the  last  three  columns  of  Table  1. 

The  broad  conclusions  to  be  drawn  from  Table  1  are  the  same  for  both  correlation 
coefficients.  Even  for  the  small  sample  size  considered  by  Efron  ( 1982),  our  linear 
approximation  procedure  gives  good  estimates  of  the  accuracy  of  the  full  oootstrap 
procedure,  and  the  relative  improvement  due  to  smoothing  is  well  predicted.  Efron’s 
conclusions  could  have  been  obtained  without  recourse  to  any  simulation.  On  the  whole 
the  delta  procedure,  which  itself  involves  some  simulation,  gives  slightly  inferior  estimates 
of  the  bootstrap's  accuracy. 

It  is  known  (Davison,  Hinkley  &  Schechtman,  1986)  that  the  variance-stabilized 
correlation  coefficient  is  highly  correlated  with  its  linear  approximation,  but  the  untrans¬ 
formed  correlation  coefficient  is  in  general  not.  The  suspicion  expressed  by  a  referee  that 
this  may  have  a  deleterious  effect  on  our  approximations  in  the  untransformed  case  does 
not  appear  to  have  been  borne  out  by  the  simulation  study,  except  that  the  beneficial 
effects  of  smoothing  the  bootstrap  were  systematically  slightly  exaggerated  by  the  linear 
method  in  this  case. 
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SUMMARY 

We  consider  the  problem  of  reconstructing  an  image  from  a  noisy  record.  We 
describe  existing  methods  due  to  Geman  and  Geman  (1984)  and  Besag  (1986)  which  use 
a  Markov  random  field  model  for  the  true  scene  but  assume  that  each  pixel  consists  of  a 
single  colour.  In  order  to  improve  the  quality  of  the  restorauon  at  the  boundary  of 
regions  of  different  colours  we  extend  these  methods  to  allow  pixels  to  contain  two 
regions  of  colour  separated  by  a  single  straight  line  An  algorithm  for  performing  the 
reconstruction  is  presented  and  illustrated  by  an  example. 

INTRODUCTION 

We  consider  a  rectangular  region  partitioned  into  pixels  labelled  1,2 . n.  Each 

pixel  is  coloured  black  or  white  and  the  colour  of  pixel  i  is  denoted  by  x,  which  takes 
the  value  0  for  white  and  1  for  black.  The  r,  are  unobserved.  We  work  instead  from  the 
observed  record  y,  which  consists  of  .t,  plus  added  noise.  We  denote  the  whole  scene  by 
x  =  { x, ;  i=l,....n)  and  the  set  of  records  by  y  =  (v,;  i=l,...,/ij.  The  noise 

distribution  will  be  assumed  to  be  known  but  if  this  were  not  the  case,  it  could  be 

established  by  studying  training  data. 

Recent  developments  in  statistical  restoration  methods  use  a  Bayesian  approach. 
The  maximum  a  posteriori  ( MAP)  estimate  of  the  true  scene  is  the  value  of  x  which 

maximises  P(x\y),  the  conditional  probability  of  x  given  the  record  v.  By  Baves’ 

theorem 

P(x |y)  «  / (y|.r)  pix),  ) 

where  ((>' |.t  i  is  the  conditional  likelihood  of  the  observed  record,  y.  given  the  true 
colouring,  .r,  and  p ix)  is  the  pnor  probability  of  i. 


We  assume  the  conditional  density  function  f(yi  to  be  known  and  for  the 
remainder  of  this  paper  we  shall  assume  that  the  records,  y,,  are  independently 
distributed  as  Gaussian  with  mean  xt  and  variance  a*.  Thus, 


l(y\x)  =  fl/(y,U,)  =  (2itc2)  2 

[=i  2a  ,=i 

To  obtain  a  valid  formula  for  p(. r),  we  assume  that  the  true  scene  corresponds  to  a 
locally  dependent  Markov  random  field  (MRF)  with  respect  to  a  specified  neighbourhood 
system,  that  is,  the  conditional  distribution  of  pixel  i  given  the  colourings  of  all  other 
pixels'  depends  only  on  the  neighbours  of  pixel  i.  We  shall  use  a  second  order 
neighbourhood  system  in  which  pixels  are  considered  to  be  neighbours  if  they  are 
horizontally,  vertically  or  diagonally  adjacent  to  each  other.  A  detailed  definition  and 
further  examples  of  Markov  random  fields  may  be  found  in  Besag  ( 1974). 

The  form  of  p(x)  is  determined  by  the  nature  of  the  Markov  random  field.  In  our 
case,  we  have 

p(x)  —  e~®z<x) , 

where  Z{x)  is  the  number  of  discrepant  pairs  in  the  scene,  x.  i.e.  the  number  of  pairs  of 
neighbours  which  are  of  opposite  colour,  and  (3  is  a  fixed  positive  constant  (normally 
chosen  to  be  between  0.5  and  1.5  ). 

The  maximisation  of  P(x |y)  now  corresponds  to  the  minimisation  of 

+  PZ(.t)  '  (2) 

2cr  1=l 

over  values  of  x  =  ( .t, ;  i  =  1 . n  ) . 

This  expression  may  be  regarded  as  a  penalty,  the  first  term  penalising  any 
difference  between  the  record  and  the  fitted  value,  the  second  term  penalising  excessive 
roughness  in  the  reconstruction.  Clearly,  with  2"  possible  values  for  x  this  is  a 
computationally  large  problem  and  necessitates  the  use  of  a  sophisticated  algorithm. 

Geman  and  Geman  ( 1984)  use  the  method  of  simulated  annealing  which  attempts  to 
find  the  MAP  estimate  of  x  given  the  record  v.  Their  method  is  computationally 
extravagant  and  more  recent  developments  by  Greig,  Porteous  and  Seheult  (1986)  show 
that  the  MAP  estimate  of  any  two  colour  scene  may  be  found  exacdy  using  the  Ford- 
Fulkerson  labelling  algorithm  for  maximring  flow  through  a  network. 

Besag  (1986)  proposed  the  computationally  simpler  method  of  iterated  conditional 
modes  (ICM)  which  updates  each  pixel  in  turn,  choosing  for  it  the  most  likely  colour 
based  on  its  record  and  the  current  colouring  of  its  neighbours.  In  updating  pixel  i  the 
new  x ,  is  chosen  to  minimise  the  sum  of  terms  involving  x,  in  the  penalty  ( 2),  i.e. 
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-—O',--*,)2  +  PZ(x.) 

2  cr 

where  Z(.q)  is  the  number  of  neighbours  of  pixel  i  in  the  current  restoration  wh'ch  are 
of  the  opposite  colour  to  .t,.  The  method  proceeds  by  scanning  the  scene,  successively 
updating  each  pixel  until  convergence  is  reached.  This  will  normally  occur  at  a  local 
rather  than  global  maximum  of  P(x\y),  but,  given  the  possibility  of  undesirable  long 
range  dependencies  in  the  MRF  model,  this  is  not  a  serious  drawback  and  might  even  be 
an  advantage. 


SPLIT  PIXELS 


So  far  we  have  considered  scenes  in  which  each  pixel  is  coloured  wholly  one 
colour.  We  now  allow  pixels  in  the  true  scene  to  be  coloured  partly  black  and  partly 
white.  Each  record  v,  is  distributed  as  Gaussian  with  variance  Cri  and  mean  p:,  the 
proportion  of  pixel  i  which  is  coloured  black.  The  restoration  methods  that  we  have 
previously  discussed  can  be  used  for  this  problem  by  proceeding  as  if  the  pixels  were 
only  of  one  colour  but  the  quality  of  the  restoration  at  the  edges  of  objects  or  regions 
will  obviously  be  poor.  Instead,  we  can  allow  pixels  in  the  restored  image  to  be 
coloured  partly  black  and  partly  white.  The  simplest  form  of  this  is  to  quarter  each  pixel 
and  allow  it  to  be  filled  with  the  most  likely  of  the  24  configurations.  This  method, 
proposed  by  Jennison  (1986)  uses  a  modified  version  of  ICM,  firstly  iterating  at  full 
pixel  size  and  subsequently  restoring  the  quarters;  in  the  second  stage  the  same  form  of 
MRF  model  is  used  for  the  subpixels  as  is  originally  used  for  full  pixels  This  method 
appears  to  work  well  and  has  prompted  work  into  the  further  breakdown  of  pixels. 

For  further  refinement  we  can  either  (i)  consider  an  mxm  breakdown  of  each  pixel 
or  ( ii)  use  continuous  lines  within  the  pixel  to  represent  the  edge.  The  implementation 
of  (i)  requires  the  minimisation  of 


n  ]  mm  n  m  m 

I  to  --T  I  I  -V)2  +  tPI  I  I  Z( 

=  1  i=l  .=  1  .=  1  t=l 


•fy). 


where  the  subscript  ijk  refers  to  subpixel  j.k  within  pixei  i;  x,jk  takes  value  0  or  1  and 
Z(X://C)  is  the  number  of  subpixel  neighbours  of  subpixel  ijk  in  the  current  restoration 
which  are  of  the  opposite  colour  to  x.]k  (the  factor  ^  is  needed  as  each  discordant  pair  is 
counted  twice).  Note  that  subpixels  at  the  edge  of  a  pixel  will  have  some  subpixel 
neighbours  contained  in  an  adjacent  pixel.  We  can  see  that  as  m  increases  this 
minimisation  becomes  computationally  cumbersome.  Also,  it  offers  only  an 
approximation  to  (ii)  and  it  turns  out  to  be  easier  to  pass  to  the  limit  and  work  directly 
with  continuous  solutions. 


The  most  basic  form  of  (ii)  allows  a  single  straight  line  edge  within  each  Dixel  and 
it  is  the  implementation  of  this  that  we  shall  describe.  It  is  no  longer  meaningful  to  talk 
of  discrepant  pixel  or  subpixel  pairs  and  we  replace  the  second  term  of  (2)  by  a  multiple 
of  the  total  length  of  edge  in  the  reconstruction  .t.  Thus,  the  restored  image  is  chosen  to 
minimise 


(3) 
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-K-'Liyi-Pi^))2  +  P'^oo. 

2c r  ,=i 

over  images  .t  made  up  of  pixels  ,  i  =  l . n,  either  of  a  single  colour  or  divided  into 

two  regions  of  different  colours  by  a  single  straight  line;  Pi(x )  denotes  the  proportion  of 
black  in  pixel  i;  L(x)  is  the  total  edge  length  in  scene  x  and  |3'  is  a  fixed  constant  related 
to  the  (3  used  earlier. 

An  advantage  of  edge  length  as  a  measure  is  that  the  penalty  is  rotationally 
invariant,  i.e.  remains  constant  throughout  all  rotations  of  the  scene  within  the  region. 
This  could  not  be  obtained  using  discrepant  pairs  as  a  measure  although  it  has  been 
shown  by  our  colleague  Robin  Sibson  that  this  variability  can  be  minimised  using  a 
down  weighting  of  I/V2  for  the  diagonal  adjacencies. 

THE  RESTORATION  ALGORITHM 

The'  restoration  is  done  in  three  stages,  the  first  two  of  which  have  already  been 
described  : 

Stage  1  :  ICM  to  convergence  on  full  size  pixel  grid. 

Stage  2  :  ICM  to  convergence  on  2x2  pixel  grid. 

Stage  3  :  Updating  process  on  the  line  segments  representing  the  edges. 

Stage  3  requires  that  we  now  regard  the  reconstruction  as  a  series  of  line  segments 
separating  the  two  colours.  An  initial  representation  is  obtained  in  a  straightforward  wav 
from  the  end  product  of  Stage  2.  The  updating  process  treats  pixels  in  pairs,  selecting 
the  best  place  for  two  edges  to  meet,  given  the  current  restoration  of  neighbouring 
pixels. 

As  an  example,  consider  the  configuration  at  pixels  i  and  j  shown  in  Figure  1.  The 
distances  a  and  b  are  determined  by  the  current  colouring  of  neighbouring  pixels  and 
treated  as  constant  for  the  moment.  The  distance  W  is  chosen  to  minimise  the 
contribution  from  pixels  i  and  j  to  the  total  penalty  (3).  i.e. 

g(W)  =  £  (yk  -  pkW)2  >  $'(elW  +  e]W)%  (4) 

2ct  k=t,j 

where  ekW  is  the  length  of  edge  in  pixel  k  when  the  join  is  at  \V  and  pkW  is  the 
proportion  of  black  in  pixel  k  when  the  join  is  at  W. 

For  the  case  shown  in  Figure  1,  this  penalty  is 

g\(W)  =  — r  {(y\-<i—t^v-a))Z+(yj-b-HW-b)r\ 

2cr 

+  P'(Vl-Hiy_a)-+\l+(W-f>):|. 
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This  can  not  be  minimised  directly  but  the  form  of 

ill'-a)  iW-b) 

\  1  -r(  U'-a  )•  \T +(W-b)z  m 

suggests  an  iterative  approach.  Given  an  approximate  solution  H/J_I  we  solve 

i  IV, -u  i  (l V-b) 

■  ■  ■  ■— —  - - ■,  =  0 

\i+(^j.i-ti):  \  -br¬ 

io  obtain 
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Starting  from  any  sensible  initial  value.  1T0,  accuracy  to  3  decimal  places  was 
achieved  after  at  most  four  iterations.  In  practice  we  take  W0  to  be  the  value  of  W  prior 
to  this  update. 

Different  forms  of  (4)  are  possible  depending  on  which  neighbours  of  pixels  i  and  j 
contain  both  colours.  There  are  only  four  distinct  cases  that  may  arise  and  these  are 
shown  in  Figure  2. 


<0 


Fig.  2.  Possible  configurations  of  edges  in  two  neighbouring  pixels. 


We  have  shown  the  method  of  solution  for  case  (i)  and  cases  (ii)  -  (iv)  are  solved 
in  a  similar  wav.  All  other  cases  can  be  reduced  to  one  of  the  above  by  means  of 
exchanging  and/or  inverting  the  pixels  and  their  colours. 

The  most  natural  order  of  updating  the  edge  pixels  would  seem  to  be  to  follow  an 
edge  around,  updating  each  join  in  tum.  completing  circuits  of  the  edge  until 
convergence.  An  alternative  method  is  to  update  every  k:lt  join  around  the  circuit, 
therefore  completing  k  laps  before  each  pixel  has  been  updated  once.  Initial  results 
suggest  that  this  provides  additional  stability  in  the  updating  process;  we  have  found  the 
value  k  =  3  to  give  particularly  good  results. 

AN  EXAMPLE 

We  illustrate  the  methods  we  have  described  with  an  artificial  example.  Figure  3a 
shows  a  true  image  and  the  superimposed  pixel  grid.  The  record  from  which  a  restored 
image  was  constructed  was  obtained  by  generating  a  Gaussian  random  variable  for  each 
pixel  with  mean  equal  to  the  proportion  of  the  pixel  coloured  black  in  the  true  image  and 
variance  0.01”.  Figure  3b  is  the  reconstruction  after  stage  1,  in  which  the  ICM  method 
with  (3=1  has  been  used,  treating  each  pixel  as  either  completely  black  or  completely 
white.  Note  that  this  is  a  rather  poor  approximation  to  the  true  image  but  it  is  the  best 
that  can  be  done  without  dividing  pixels.  Subdividing  each  pixel  into  four  in  stage  2 
produces  the  reconstruction  in  Figure  3c:  the  amounts  of  black  in  each  full  pixel  are  now 
much  closer  to  the  corresponding  records  and  the  divisions  of  split  pixels  match  up  well 
with  the  true  image.  Proceeding  to  stage  3.  we  found  that  using  |3'=50  gave  better 
results  than  those  obtained  using  lower  values  of  (3'.  The  final  reconstruction  is  shown 
in  Figure  3d.  Despite  the  coarseness  of  the  original  pixel  grid  and  the  addition  of  noise 
to  the  record,  this  reconstruction  is  barely  distinguishable  from  the  true  image. 


Fig  3a  True  image 


Fig  3b  Reconstruction  after  stage  1 


Fig  3c  Reconstruction  after  stage  2  Fig  3d  Final  reconstruction 

FURTHER  EXTENSIONS 

(a)  Consider  a  pixel  which  has  true  colouring  as  shown  in  Figure  4.  Clearly  the 
straight  line  approximation  to  this  edge  will  be  poor  and  could  have  an  adverse  effect  on 
the  reconstruction  of  neighbouring  pixels  and  pixels  further  alorg  the  edge.  This  may  be 
overcome  using  a  more  intricate  restoration  method,  e.g.  allowing  two  straight  lines 
meeting  at  some  point  within  a  pixel. 


1 


Fig.  4.  A  pixel  containing  a  boundary 
that  can  not  be  approximated  well 
by  a  single  straight  line. 


tb)  The  method  .presented  in  this  paper  can  be  exiended  to  scenes  containing  more 
than  two  different  colours.  Where  any  two  regions  meet  we  can  adjust  the  algorithm  to 
provide  a  continuous  line  join.  More  computation  is  required  to  find  the  best  colouring 
for  a  pixel  in  which  three  or  more  regions  meet. 
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Abstract 

One  of  the  ingredients  of  recent  methodology  in  statistical  image  restoration  is  the 
idea  of  introducing  a  system  of  "edges1'  between  pixels  in  the  image.  If  an  edge  is 
present  between  two  contiguous  pixels  then  they  are  not  considered  as  neighbours  in 
the  restoradon  procedure.  In  penalized  maximum  likelihood  esdmadon  of  the  image, 
the  number  and  configuration  of  the  edges  is  controlled  by  a  penalty  term;  in  model- 
based  restoradon  using  random  fields  there  is  an  analogous  penalty  term  in  the  energy 
function  of  the  Gibbs  distribution  for  the  edge  process.  In  this  paper  we  show  how 
some  geometrical  insights  can  be  used  to  provide  penalties  for  the  various  edge 
configurations  in  a  way  that  is  roughly  independent  of  the  pixel  discretisation.  The 
penalties  obtained  arc  consistent  over  pixels  of  different  sizes,  shapes  and  orientations, 
even  if  these  occur  in  the  same  pattern.  The  cases  of  square,  rectangular,  hexagonal 
and  irregular  pixels  are  considered. 
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I.  Introduction 


Genian  and  Genian  [2]  discussed  a  methodology  for  pixel  image  restoration 
which  depended  on  the  idea  of  modelling  the  true  image  by  a  Markov  random  field. 
A  key  feature  of  their  approach  was  the  possible  placing  of  "edge  elements"  at  "line 
sites"  between  pixels  of  the  image.  Although  the  idea  of  introducing  an  edge  process 
was  introduced  in  the  Markov  random  field  context,  its  applicability  is  by  no  means 
confined  to  model-based  methods  of  image  restoration  and  it  is  important  that  the 
construction  of  the  process  should  be  given  careful  consideration. 

The  edge  process  idea  corresponds  to  the  notion  that  the  image  is  segmented  into 
regions  over  each  of  which  its  behaviour  is  relatively  homogeneous,  or  at  least  is  not 
subject  to  abrupt  changes;  from  one  region  to  another,  however,  large  differences  in 
behaviour  are  possible.  The  changes  in  behaviour  may  relate  either  to  overall  grey 
level  or  colour,  or  to  more  subtle  properties  such  as  texture.  Of  course,  the  basic 
motivation  for  this  kind  of  segmentation  of  the  image  is  that  the  true  scene  is  itself 
segmented  into  regions,  and  the  edge  process  in  the  model  is  an  attempt  to 
approximate  boundaries  that  are  present  in  the  true  scene.  For  example,  in  the  context 
of  remote  sensing  of  a  rural  area,  the  boundaries  would  correspond  to  topographic 
features  like  rivers  and  field  boundaries.  Our  aim  in  this  paper  is  to  investigate  the 
consequences  of  thinking  of  the  edge  process  as  being  a  discretised  version  of  an 
underlying  "true"  pattern  of  boundaries.  In  particular  we  are  interested  in  the 
calculation  of  quantitative  summaries  of  the  discretised  edge  process  that  have  genuine 
meaning  in  terms  of  properitics  of  the  underlying  boundary  pattern,  for  example  the 
total  boundary  length  and  the  complexity  of  the  pattern  of  regions  defined  by  the 
boundaries. 

In  Geman  and  Geman’s  approach,  a  prior  distribution  for  the  true  image  is 
constructed  by  first  constructing  a  prior  Gibbs  distribution  for  the  process  of  edge 
elements  and  then  specifying  the  prior  for  the  pixels  themselves  conditional  on  the 
edge  process.  In  the  specification  of  the  pixei  process,  contiguous  pixels  separated  by 
a  line  site  at  which  an  edge  element  is  actually  present  arc  not  considered  as 
neighbours,  and  so  are  allowed  to  have  quite  different  grey  levels  without  incurring 
any  penalty  in  the  prior  likelihood. 

An  alternative  approach  in  which  an  edge  process  is  equally  important  is 
penalised  maximum  likelihood;  for  background  reading  see,  for  example,  [4],  In  the 
image  analysis  context,  the  image  is  considered  as  a  high-dimensional  unknown 
parameter,  and  a  penalised  log  likelihood  is  constructed  by  subtracting  from  the  log 
likelihood  of  the  image  given  the  observed  data  a  pena'ty  term  based  on  the 
"dirtyness"  of  the  image.  The  idea  of  penalised  likelihood  is  that  there  are  two 
conflicting  aims  in  image  restoration;  one  is  to  obtain  a  faithful  fit  to  the  data,  as 
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measured  by  the  likelihood,  while  the  other  is  to  obtain  a  "clean"  image,  corresponding 
to  a  low  value  of  the  penalty.  For  reasons  we  shall  discuss  in  Section  2  below,  a 
convenient  penalty  to  use  is  the  energy  function  of  the  prior  Gibbs  distribution  for  the 
image  as  considered  as  a  realisation  from  a  random  prior  process.  In  that  case  the 
method  of  maximum  a  posteriori  estimation  as  proposed  in  [2]  yields  exactly  the  same 
restored  image  as  the  penalised  maximum  likelihood  approach,  even  though  the 
philosophy  behind  the  two  approaches  is  different. 

In  this  paper  we  shall  focus  attention  on  the  specification  of  a  suitable  penalty  for 
the  edge  process.  We  shall  show  how  various  geometrical  insights  suggest  how  such  a 
penalty  should  be  constructed.  Our  discussion  will  suggest  relative  costs  for  possible 
configurations  somewhat  different  from  those  proposed  by  Geman  and  Geman  [2].  In 
addition  our  scheme  will  provide  methods  for  dealing  with  rectangular,  hexagonal  and 
irregular  pixel  patterns. 

For  any  given  penalty  function  the  Gibbs  distribution  with  energy  equal  to  the 
penalty  defines  a  stochastic  model  for  the  edge  process.  However,  we  stress  that  our 
interest  is  in  developing  the  penalty  for  use  in  image  restoration  algorithms,  rather  than 
in  studying  the  theory  of  stochastic  models  for  the  edge  process.  Apart  from  our 
intended  application  to  image  restoration,  the  problem  of  estimating  the  underlying 
edge  length  for  a  discretized  image  is  of  interest  in  its  own  right;  see,  for  example, 
Dorst  and  Smeulders  [1], 

n.  Locally  Based  Penalty  Functions 

The  Gibbs  distribution  approach  constructs  a  prior  likelihood  for  the  edge  process 
by  first  defining  a  set  of  cliques  of  line  sites.  Each  clique  C  consists  of  a  small  set  of 
sites;  in  the  Geman  and  Geman  paper  the'  cliques  are  the  collections  of  four  line  sites 
with  a  common  vertex.  The  Gibbs  model  then  gives  as  the  prior  probability  of  any 
configuration  <o 

tt(cu)  =  Z_1exp(-f/(<a)J 
where  Z  is  a  constant  and  the  energy  function  U  satisfies 

U{co)  =  £  Vc(tu). 

cliques  C 

Each  Vc  is  a  function  which  depends  only  on  those  elements  of  to  that  correspond  to 
sites  in  the  clique  C.  Each  clique  consists  of  a  set  of  sites  all  of  which  are 
"neighbours"  of  one  another  in  some  suitable  sense,  and  hence  the  energy  function 
U(qj)  can  be  constructed  by  looking  at  cliques  individually;  looking  ahead  to  the 
prospect  of  large  scale  parallel  processing,  this  localisation  property  is  likely  to  be  of 
extreme  importance  in  the  future. 
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In  practical  applications  the  observed  record  often  consists  of  the  true  image 
observed  indirectly  and  subject  to  the  addition  of  random  noise.  Maximum  a 
posteriori  likelihood  estimation  of  the  underlying  true  image  is  achieved  by 
maximizing  over  possible  images  the  likelihood  of  the  observed  record  given  a  true 
image  multiplied  by  the  prior  probability  of  that  image.  Equivalently,  one  maximizes 
the  sum  of  the  log  prior  likelihood  -U(co)  and  the  log  likelihood  of  the  record  given 
the  true  image  co. 

The  philosophical  approach  we  shall  follow  is  to  consider  U(w)  not  directly  as  a 
prior  negative  log  likelihood,  but  rather  as  a  penalty  function  for  a  given  configuration 
co.  The  penalty  function  is  subtracted  from  the  log  likelihood  of  the  record  given  the 
true  image  to  give  a  penalised  log  likelihood,  maximisation  of  which  corresponds  to 
maximum  a  posteriori  estimation  in  the  Bayesian  model.  Compare  the  spline 
smoothing  approach  to  nonparametric  regression  where  the  penalty  term  j  g"1  can  be 
considered  either  as  a  direct  "roughness  penalty"  or  as  a  term  in  a  prior  log  likelihood; 
for  bibliography  on  spline  smoothing,  see,  for  example,  Silverman  [5].  As  already 
mentioned  above,  the  use  of  a  "locally  computable"  penalty  like  U(co)  has  enormous 
potential  advantages  in  an  array  processing  computer  environment,  and  it  is  on  such 
penalties  that  we  shall  concentrate  in  this  paper. 

It  is  implicitly  assumed  in  the  usual  restoration  methods  that  each  pixel  of  the 
true  image  consists  of  a  single  grey  level  and  edges  of  regions  or  objects  lie  along 
pixel  boundaries.  It  is  more  realistic  to  assume  the  existence  of  a  real  image  in  the 
plane  not  necessarily  related  to  any  pixel  grid;  the  so  called  "true"  pixel  image  is  then 
a  discretization  of  the  real  image.  We  shall  assume  in  addition  that  the  real  image  is 
made  up  of  a  number  of  regions  divided  by  boundaries  and  that  the  edge  process  in 
the  "true"  pixel  image  is  constructed  to  approximate  the  real  boundaries  as  closely  as 
possible.  Our  approach  is  to  attempt  to  specify  the  form  of  the  function  Vc  in  such  a 
way  that  the  penalty  associated  with  a  pixel  edge  process  in  the  "true"  pixel  image 
will,  as  far  as  possible,  not  depend  on  the  way  the  pixels  are  constructed  or  placed  on 
the  real  image;  instead  the  value  of  the  penalty  U(co )  will  give  a  cost  based  at  least 
approximately  on  the  real  underlying  boundary  pattern.  Particular  concerns  will  be  to 
eliminate,  as  far  as  possible,  the  effect  of  the  position  and  orientation  of  a  square 
lattice;  to  discuss  how  to  modify  the  penalty  if  the  lattice  is  refined;  and  to  devise 
appropriate  penalties  for  irregular  pixel  patterns.  The  penalty  we  shall  use  will  have  as 
one  ingredient  an  estimate  of  the  total  boundary  length  in  the  underlying  "real"  image, 
and  so  has  relevance  to  the  problem  discussed  in  [  1  ]. 
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El.  Square  Lattices 

Let  us  cum  first  to  the  case  of  the  square  lattice,  considered  by  Geman  and 
Geman  [2],  Suppose  that  the  gauge  of  the  lattice  is  h,  and  that  each  clique  consists  of 
the  four  line-sites  meeting  at  a  particular  vertex.  The  possible  configurations  and  the 
costs  ascribed  to  these  configurations  by  Geman  and  Geman  [2]  are  shown  in  Ftgure 
3.1. 
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Figure  3.1:  Possible  types  of  configuration  for  regular  edge  process, 
and  the  costs  ascribed  to  them  by  Geman  and  Geman  [2]. 


Note  that  the  low  cost  of  a  continuation  relative  to  the  cost  of  an  ending,  branch 
or  crossing  is  intended  to  favour  a  small  number  of  long  straight  edges  over  complex, 
meandering  edge  systems.  However,  we  shall  see  that  this  fails  to  provide  an  adequate 
treatment  of  long  straight  edges  at  orientations  away  from  the  horizontal  and  vertical. 
We  shall  write  vL  for  the  cost  of  a  configuration  of  type  t,  and  explore  the 
consequences  of  various  choices  of  v,  . 

A.  Boundary  Length  Considerations 

Consider,  now,  the  cost  of  a  very  simple  pattern,  consisting  of  an  infinitely  long 

straight  line  placed  at  angle  9  to  one  of  the  edge  directions  of  the  lattice:  without  loss 

of  generality  0  S  9  <,  The  discretization  will  replace  the  line  by  a  stepped  pattern 
4 

of  the  form  shown  in  Figure  3.2. 


Because  9  £  — ,  vertical  segments  will  always  be  separated  by  one  or  more  horizontal 

segments.  Over  a  long  distance  L  in  the  z-direction,  the  number  nz  of  horizontal 
segments  will  be  asymptoticaly  Lh~l,  and  the  number  ny  of  vertical  segments 
asymptotically  Lh~ltan  9.  The  number  of  configurations  of  type  2  will  be  Iny  and  the 
number  of  type  3  will  be  n,  -  n,.  Thus  the  total  cost  will  be 

2jtyV2  +  (flx  "  ny)v3  -  (v3  +  (2v2  -  v3)  tan  9). 

The  total  length  of  underlying  boundary  is  L  sec  9,  and  so  the  cost  c(9)  per  unit  length 
of  underlying  boundary  is,  for  large  L, 

c(9)  =  ~ 1  { v3  cos  9  +  (2v2  -  v3)  sin  9).  (3.1) 

The  ideal  situation  would  be  for  c(9)  to  be  independent  of  9,  but  (3.1)  makes  it  clear 
that  this  is  impossible.  Define  a  =  v2/v3.  A  natural  index  of  how  far  c(9)  falls  short 
of  ideal  is  given  by  the  ratio 

1(a)  =  max  c(9)/  min,  c(9). 

0SSS*/4  0  S0S*/4 

This  ratio  depends  only  on  a.  If  a  2  1,  c(9)  is  monotonically  increasing  in  [0,—  ], 

4 

and  so  1(a)  =  c(^)/c( 0)  =  crl 2.  If  a  S  c(9)  is  monotonically  decreasing  in 

[0,-j],  and  1(a)  =  c(0)/c(^-)  =  l/(<W2).  To  deal  with  \  <  a  <  1,  define 

9q  -  tan'1  (2a  -  1)  and  rewrite 

c(9)  =  h~lv 3  sec  90(cos  90  cos  9  +  sin  90  sin  0) 

=  h'lv3  sec  90  cos(9  -  90).  (3.2) 

Since  for  J  <  a  <  1  we  have  0  <  90  <  it  follows  that,  for  a.  in  this  range,  c(9) 
has  a  maximum  at  90  and  that  1(a)  =  max(sec  90,  sec  ( ~  -  90)).  Hence  1(a)  is 
minimized  by  setting  90  =  — . 

O 


The  minimum  value  sec^  =  (4  -  2V2)1/Z  =  1.082.  Thus  it  follows  that  the 

O 

minimax  score  1(a)  is  optimized  by  setting  2a  -  1  =  tan  which  implies  that 

O 

a  =  \ (l+tan-^-)  =  l/'/2.  If  this  value  of  a  is  used,  then  lines  parallel  to  the  lattice 

O 


directions  or  those  at  45°  to  these  directions  will  cost  the  same  amount  per  unit  length, 
while  the  most  expensive  lines  will  be  those  at  22$°  to  the  axis  directions,  which  will 


cost  about  8%  more.  It  is  interesting  to  note  that  the  Geman  and  Geman  value 
a  =  2  yields  1(a)  -  2 v2  =  2.83,  a  much  larger  value. 
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It  can  also  be  shown,  by  somewhat  tedious  algebra,  that  a  =  1/V2  also  minimizes 
other  criteria  of  variability  of  c(9),  for  example  the  coefficient  of  variation  of  c(9) 

with  9  uniformly  distributed  over  [0 ,  -  j  ] . 

The  arguments  of  this  section  make  it  possible  to  settle  on  a  charge  for 
configurations  of  types  2  and  3.  Suppose  it  is  intended  to  penalize  boundaries  in  the 
underlying  picture  by  an  amount  0  per  unit  length.  In  an  ideal  world  we  would  like  to 
choose  v2  and  v3  in  (3.1)  to  ensure  that  c(9)  =  0  for  ail  9.  As  we  have  seen,  this 
cannot  be  attained  exactly  for  all  9 ,  but  setting  v2/v3  =  2~^2  will  minimize  the 
variability  of  c(9)  as  9  varies.  Having  settled  the  ratio  v2/v3,  it  is  natural  to  choose  v3 
to  ensure  that  (2 jt)~  1 JQ  c(9)d8  =  0.  By  simple  algebra,  from  (3.2), 

(2 x)-'\^c(9)dB  =  4*-'  Jo*^h"1v3  sec  ( j)  cos  {9  -  j)d9 

=  8*_lA_1v3  tan  (~)  =  v3/t -lXr- 1 
0 

where  the  constant  k  =  -j-/tan  (y)  =  0.948. 

O  O 

It  follows  that  setting  v3  =  k0h  and  v2  =  2 ~]^2k0h  will  ensure  that,  while  c(9)/ 0 
is  only  exactly  1  for  certain  values  of  9,  it  will  be  the  case  that  c(9)/  0  lies  between 
0.948  and  1.027  for  all  9  and  furthermore  that  the  average  value  of  c(9 )  over 
(uniformly  distributed)  9  is  precisely  0. 

The  above  results  are  also  relevant  to  the  problem  of  estimating  the  underlying 
edged  length  from  a  disertized  image  as  posed  in  (1].  Suppose  a  line  of  fixed  length 
placed  at  orientation  9  uniformly  distributed  over  [0,2*]  has  iV2  turns  and  /V3 
continuations  in  its  discretized  form.  Then  2 ~^kftNz  +  khN3  is  an  unbiased  estimate  of 
the  line’s  original  length;  it  is  the  minimum  variance  unbiased  estimator  among 
estimators  of  the  form  aN2  +  bN-$,  as  a  consequence  of  the  fact  that  a=l/v2 
minimizes  the  coefficient  of  variation  of  c(9). 

Turning  now  to  the  question  of  how  much  to  charge  for  branches  and  crossings, 
we  shall  explain  in  the  next  section  how  a  simple  argument  concerned  with  counting 
the  number  of  regions  in  the  pattern  leads  to  a  paradigm  for  dealing  with  these 
configurations. 

B.  Counting  Regions 

Suppose  that,  in  the  original  pattern,  the  plane  is  divided  up  into  a  number  of 
simply  connected  regions,  and  tha*  the  edge  process  is  an  approximation  to  the 
boundaries  between  regions  in  this  configuration.  Assume  that  the  pixel  size  is 
sufficiently  small  relative  to  the  scale  of  regions  in  the  pattern  that  each  region  is 


represented  by  a  single  connected  set  of  pixels  in  the  discretized  image. 

Apart  from  the  total  edge  length,  a  natural  measure  of  the  complexity  of  the 
pattern  of  regions  is  given  by  the  number  of  regions.  If  the  region  boundaries  form  a 
connected  set,  or  equivalently  if  the  regions  are  simply  connected,  the  number  of 
regions  can  be  counted  simply  by  counting  the  number  of  "branches"  and  "crossings" 
in  the  edge  pattern.  To  do  this,  the  Euler-Poincare  formula  [3,  p.241]  is  used. 

Suppose  the  original  process  is  observed  on  a  window  W  in  the  plane  and  at  least 
one  boundary  intersects  the  window  edge.  Define  a  vertex  to  be  a  point  where  three 
or  more  regions  meet,  or  where  the  boundary  between  two  regions  meets  the  edge  of 
the  window.  Define  a  boundary  section  to  be  the  piece  of  boundary  or  of  window 
edge  between  two  vertices.  Let  ^  be  the  number  of  vertices  in  the  pattern,  nt  the 
number  of  boundary  sections  and  ry  the  number  of  regions.  The  Euler-Poincare 
formula  gives  the  equation 

n^  —  ne  +  nr  —  1  • 

and  hence 

nf  ~  1  +  ne  ~  'K  ■ 

Now  both  nr  and  can  be  found  by  counting  the  number  of  branches  and  crossings 
in  the  pattern,  provided  that  points  where  an  edge  meets  the  edge  of  the  window  count 
as  branches.  Let  nb  be  the  number  of  branches  and  nc  the  number  of  crossings.  It  is 
immediate  that 

=  nb  +  rtc.  (3.3) 

In  order  to  count  the  number  of  boundary  sections,  notice  that  three  boundary 
sections  meet  at  each  branch  and  four  at  each  crossing.  Thus  the  number  of  ends  of 
boundary  sections  in  3nb  +  4 nc,  and  since  each  boundary  section  has  two  ends,  we 
have 

ne  =  \nb  +  ~no-  (3-4) 

Substituting  (3.3)  and  (3.4)  into  (3.5)  yields 

nf  ~  1  +  W  +  nc-  (3.5) 

Formula  (3.5)  gives  a  natural  price  to  be  charged  for  branches  and  crossings.  If  it  is 
desired  to  penalize  an  amount  p  for  each  region  in  the  pattern,  then  one  should  charge 
\p  for  each  branch  point  and  p  for  each  crossing.  If  the  edge  configuration  gives  rise 
to  regions  that  are  not  simply  connected  the  right  hand  side  of  (3.5)  must  be  increased 
by  1  for  each  connected  set  of  edges  which  does  not  intersect  the  window  edge.  The 
charge  (\nb  +  nc)p  can  be  considered  in  its  own  right  as  a  penalty  for  the  complexity 
of  the  edge  pattern  which  is  calculable  from  local  properties.  The  extra  cost  of  p  for 
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each  isolated  connected  set  of  edges  cannot  be  calculated  from  local  properties  and, 
thus,  cannot  be  included  in  a  restoration  algorithm  that  operates  entirely  by  local 
updating;  such  an  algorithm  might,  however,  be  extended  to  investigate  the  complete 
removal  of  a  small  connected  set  of  edges  in  the  later  stages  of  reconstruction. 

This  scheme  of  charging  for  branches  and  crossings  does  not  include  a  cost  for 
the  boundary  length  involved.  We  shall  return  to  this  point  in  Section  5  after  the 
necessary  tools  have  been  developed. 

C.  Endings 

A  pattern  made  up  of  disjoint  regions  cannot,  of  course,  have  a  configuration  of 
edges  containing  any  endings  at  all.  Therefore  the  philosophy  that  we  are  adopting 
would  naturally  lead  to  an  infinite  charge  for  configurations  of  type  1  in  Figure  3.1. 
This  might  still  not  be  completely  acceptable:  although  the  configuration  in  Figure  3.3a 
is  prohibited,  that  in  Figure  3.3b  is  still  allowed  but  note  that  points  P  and  Q,  which 
lie  close  together  in  the  same  region,  are  separated  by  an  edg-  I:  must  also  be 


Figure  3.3:  Possible  edge  configurations. 

remembered  that  to  set  any  penalty  value  to  infinity  may  lead  to  algorithmic 
difficulties  in  using  the  model  in  practice.  Also,  a  prior  model  for  the  edge  process 
under  which  some  configurations  have  probability  zero  violates  the  condition  of 
positive  probability  for  all  configurations  under  which  the  theory  and  practice  of 
Markov  random  fields  are  developed;  see,  for  example.  Section  4  of  [2],  In  any  case, 
it  seems  excessively  dogmatic  to  exclude  certain  configurations  completely,  since  there 
may  be  good  physical  reasons  for  a  boundary  to  peter  out  in  the  middle  of  a  region. 
Therefore  an  approach  that  is  likely  to  be  more  satisfactory  is  to  ascribe  a  cost  X  to 
each  "loose  end"  in  the  boundary  pattern,  where  X  is  set  to  a  relatively  large  value.  In 
fact,  there  is  no  advantage  in  setting  X  much  greater  than  \p  since  a  clever 
reconstruction  algorithm  can  simply  build  a  small  loop  of  edges  onto  a  loose  end  at  a 
cost  of  \p  for  the  branch,  plus  the  cost  of  the  edge  length  involved. 
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D.  Summary  and  Example 

We  now  summarize  the  appropriate  relative  costs  of  different  configurations.  Let 
p  be  the  desired  cost  per  unit  length  of  edge,  p  the  cost  per  region  of  the  pattern  and  X 
the  cost  per  loose  end.  Then  the  programme  we  have  set  out  gives  as  the  costs 
ascribed  to  possible  configurations  the  costs  set  out  in  Table  3.1.  Costs  for  edge 
length  and  region  counting  appear  separately;  the  costs  for  edge  length  associated  with 
configurations  3,  4  and  S  will  be  derived  in  Section  5.  As  explained  in  Section  3C  the 
value  X  =  \p  is  a  reasonable  choice  but  there  is  no  obvious  relation  between  p  and  /J. 
The  interpretation  of  p  as  a  cost  per  unit  length  of  boundary  makes  it  possible  to 
adjust  the  scores  in  a  reasonable  way  if  the  pixel  grid  is  refined,  since  the  cost  of  edge 
length  in  configurations  1  to  5  is  adjusted  automatically. 


0  (none) 

1  (ending) 

2  (turn) 

3  (continuation) 

4  (branch) 

5  (crossing) 

0 

0.412  hfi  +  \p 
0.670  HP 

0.948  hp 
\Ahp  +  \p 
1.94/1/3  +  p 

Table  3.1.  Proposed  costs  for  the  configurations  of  Fig.  3.1. 

In  order  to  provide  an  illustration  of  the  results  derived  in  this  section,  some  costs 
for  the  pixellated  edge  patterns  shown  in  Figure  3.4  were  calculated.  The  pixel  size 
for  Figures  3.4c  and  3.4d  is  half  that  used  in  Figures  3.4a  and  3.4b,  and  the  unit  of 
length  is  taken  such  that  h  -  1  in  Figures  3.4a  and  3.4b.  The  costs  are  presented  in 
Table  3.2.  Our  costs  are  given  in  terms  of  P  and  p,  and  are  also  evaluated  for  the 
case  P  =  1,  p  =  50.  It  can  be  seen  that  rotating  the  pattern  affects  the  Geman  and 
Geman  costs  quite  substantially  but  has  very  little  effect  on  the  costs  calculated  using 
our  methods.  It  can  also  be  seen  that  our  costs  maintain  consistency  across  the 
different  pixel  sizes.  The  slightly  larger  costs  obtained  for  the  smaller  pixels  is 
presumably  due  to  a  "fractal"  effect  in  the  discretisation  of  the  coastline. 
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Figure 

Number  of  cliques 
type  2  type  3  type  4 

Geman 

cost 

Proposed 

cost 

Proposed  cost 
with  /J=l,  ?=50 

3.4a 

A  =  1 

142 

153 

8 

453 

251.4/}  +  4p 

451.4 

3.4b 

A  =  l 

237 

95 

8 

585 

260.0/J  +  4  p 

460.0 

3.4c 

pSSPlO 

333 

329 

8 

1011 

273.1/3  +  4? 

473.1 

3.4d 

493 

221 

8 

1223 

275.5/3  +  4? 

475.5 

Table  3.2.  Costs  of  edge  patterns  shown  in  Fig.  3.4 


.  'i 


) 


Fig.  3.4.  Four  discretisations  of  the  same  edge  pattern. 
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IV.  Irregular  and  Uneven  Pixel  Arrays 

In  this  section  we  turn  to  the  case  of  arrays  of  pixels  that  are  no  longer  based  on 
a  regular  square  lattice.  One  such  case  arises  if  a  pixel  pattern  based  on  polar 
coordinates  is  used,  as  shown  in  Figure  4.1.  Such  circular  pixel  patterns  arise  very 
naturally  in  the  restoration  of  images  observed  by  positron  emission  tomography;  see 
Silverman  et  aL  [6]  for  an  application  of  the  circular  pixel  patterns  and  Vardi,  Shepp 
and  Kaufman  [7]  for  a  general  discussion  of  the  positron  emission  tomography 
problem. 


Fig.  4.1.  A  circular  pixel  array  useful  for  positron  emission  tomography  images. 

In  general,  the  pixels  might  be  more  irregular  still,  and  might  even  themselves  be 
generated  by  a  random  process.  This  is  unlikely  to  be  the  case  where  the  experimenter 
has  control  over  the  pixel  pattern.  However,  irregular  pixels  may  well  occur,  for 
example,  in  geographical  applications,  where  the  observed  "image''  is  made  up  of 
measurements  averaged  or  cumulated  over  small  irregularly-shaped  regions,  and  it  is 
not  felt  desirable  to  superimpose  a  regular  grid  on  the  existing  irregular  pixel  pattern. 

A.  Cliques  for  frregular  Edge  Processes 

We  shall  assume,  for  the  moment,  that  the  pixel  pattern  forms  a  tessellation  of  the 
plane  or  a  portion  of  the  plane,  and  that  except  at  the  boundary  of  the  pattern,  exactly 
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three  pixels  meet  at  each  vertex  of  the  tessellation.  This  assumption  is,  of  course, 
violated  for  the  circular  lattice  of  Figure  4.1,  since  some  of  its  vertices  are  of  degree 
three  and  some  of  degree  four.  It  does,  however,  hold  (with  probability  1)  for  many 
randomly  generated  pixel  models,  for  example  if  the  pixels  are  the  Voronoi  polygons 
of  a  homogeneous  planar  Poisson  process. 

As  in  the  case  of  the  square  lattice,  the  line-sites  in  the  edge  process  will  be  the 
boundary  sections  of  the  pixel  array,  and  we  shall  suppose  that  each  clique  of  the  edge 
process  consists  of  the  three  line-sites  meeting  at  a  particular  vertex.  There  are  now 
four  possible  types  of  configurations  for  a  particular  clique  in  the  edge  process, 
depending  on  how  many  edges  are  present  in  the  clique.  We  shall  say  that  the 
configuration  is  of  type  it  for  fc  =  0,1, 2, 3  if  k  of  the  three  line-sites  in  the  clique  are 
actually  occupied  by  edges.  These  configurations  are  illustrated  in  Figure  4.2. 


Type  0  12  3 

Figure  4.2:  Configurations  for  a  vertex  of  degree  3. 


Although,  in  contrast  with  the  case  of  square  pixels,  there  are  fewer  types  of 
configuration  to  consider,  the  irregularity  of  the  pixels  means  that  it  is  no  longer 
necessarily  the  case  that  all  configurations  of  a  particular  type  should  attract  the  same 
penalty. 

The  first  stage  in  the  assignment  of  costs  to  various  configurations  is  to  use  the 
same  region  counting  arguments  as  in  the  square  lattice  case  to  assign  charges  0,  \p 
and  ip  to  configurations  of  types  0,  I  and  3  respectively.  It  remains  to  ascribe  costs 
for  the  edge  length  associated  with  each  configuration.  In  order  to  do  this,  construct  a 
dual  edge  pattern  by  placing  a  point  in  each  cell  of  the  original  pixel  array,  and  joining 
points  if  their  corresponding  pixels  have  some  boundary  in  common.  The  vertices  of 
the  dual  array  can,  in  principle,  be  placed  anywhere  in  their  corresponding  pixels,  but 
in  practice  they  will  have  a  natural  position.  For  example  if  the  pixels  are  constructed 
as  the  Voronoi  polygons  of  a  point  process  then  the  points  of  the  process  will 
themselves  be  the  vertices  of  the  dual  array. 

Our  assumption  that  exactly  three  pixels  meet  at  each  vertex  of  the  original 
tessellation  implies  that  the  dual  edge  pattern  will  be  a  triangulation  of  the  plane.  In 
the  case  of  the  square  pixel  array  the  cost  of  "continuation''  configurations  was 
determined  by  considering  a  pattern  with  a  single  long  straight  edge,  suitably 
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discretized  to  fit  the  pixel  pattern.  In  the  more  general  case,  it  is  no  longer  quite  so 
clear  how  this  discretization  should  be  performed.  One  natural  way  to  proceed  is  to 
scribe  that  an  edge  segment  will  be  present  in  the  edge  process  if  and  only  if  the 
corresponding  dual  edge  is  intersected  by  the  straight  line  boundary.  We  assume,  if 
necessary  giving  the  line  an  infinitesimal  displacement  perpendicular  to  its  direction, 
that  no  vertices  of  the  dual  triangulation  lie  exactly  on  the  line. 

Any  edge  process  in  the  original  tessellation  corresponds  to  an  edge  process  in 
the  dual  triangulation  in  the  natural  way,  a  dual  edge  being  present  in  the  process 
whenever  the  corresponding  original  line  site  is  occupied.  Each  clique  of  the  original 
edge  process  will  correspond  to  a  triangle  in  the  dual  triangulation;  the  original  edge 
configuration  will  be  a  "continuation1'  if  and  only  if  exactly  two  of  the  edges  are 
present  in  the  corresponding  dual  clique.  Every  triangle  intersected  by  the  straight  line 
boundary  will  give  rise  to  a  continuation  clique,  since  exactly  two  of  its  edges  are 
necessarily  intersected  by  the  line.  We  shall  now  describe  two  possible  approaches  to 
charging  for  continuation  configurations.  The  first  of  these  appears  more  natural  at 
first  sight,  but  it  leads  to  much  more  complicated  formulas;  we  shall  also  show  that  the 
second  has  the  additional  advantage  of  being  a  genuine  generalization  of  our  square 
pixel  formulas. 

B.  Two  Possible  Length  Penalties 

Let  us  concentrate  on  a  single  triangle  i  in  the  dual  triangulation.  The  notation 
for  this  triangle  will  be  as  in  Figure  4.3.  The  capital  letters  A,B,C  will  be  used  for 
both  the  vertices  themselves  and  for  the  angles  at  these  vertices. 


A 


Figure  4.3:  A  triangle  in  the  dual  triangulacion. 


The  lower  case  letters  refer  to  the  sides  themselves  and  to  the  lengths  of  these  sides. 
We  shall  derive  possible  ways  of  charging  for  the  "continuation''  configuration  be 
given  by  the  presence  of  the  edges  dual  to  b  and  c  and  the  absence  of  the  edge  dual  to 
a.  These  costs  will  be  based  on  the  general  idea  that  boundaries  should  cost  an 
amount  0  per  unit  length;  for  notational  simplicity  we  shall  assume  henceforth  that 
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0  =  1,  and  note  that  the  costs  obtained  should  be  multiplied  by  0  in  the  general  case. 

Let  l  be  a  random  line  of  fixed  length  L  randomly  situated  in  the  plane,  in  a  sense 
that  will  be  made  precise  below.  Let  Lj  be  the  length  of  the  intersection  of  /  with  the 
triangle  i.  Then  our  first  possible  cost  for  the  continuation  configuratioii  be  is 

Vtfi.b.c)  =  £(£,j  |  /  intersects  triangle  i  through  b  and  c). 

The  motivation  for  this  definition  is  straightforward.  Let  l(i,el,e2 )  be  the 
indicator  variable  taking  the  value  1  if  l  intersects  sides  ex  and  e2  of  triangle  i  and  0 
otherwise.  Let  Z,(i,e,  ,e2)  be  the  length  of  the  intersection  of  l  with  triangle  i  if 
I(i,el  ,e2)  =  1  and  0  otherwise.  Now,  ignoring  end  effects,  the  total  line  length 
L ■  *  2  L(i,ex  ,e2)  where  the  summation  is  taking  over  all  triangles  i  and  pairs  of 
edges  («j  ,e2).  The  total  charge  for  the  line  l  is  then 

Si  =  Z  >e2>  v\ (»\«i  .«2>- 

Thus,  up  to  the  approximation  involved  in  ignoring  end  effects, 

E(Sy)  =  £[I/(i,e,,e2)  £(L(i.e,  ,e2)  I  /(/.«!  ,«2)  =  1}] 
*£[I£(t(i.«,.«2)J 

a  £{£!.(/,«,, e2))  =£(£-)  =  £  (4.1) 

and  5[  is  an  unbiased  estimate  of  the  line  length,  £.  It  is  clear  that  the  above 
argument  will  also  hold  if  Vl(i,el,e2)  is  replaced  by  E{g(i,eI  ,e2)  |  /(i,el  ,e2)  =  1) 
for  any  function  g(i,ex  ,e2)  for  which,  apart  from  end  effects,  '£gU,el  ,e2)  =  L.  Our 
second  proposed  cost  is  also  of  this  general  form. 


Figure  4.4:  A  side  with  negative  projected  length. 


Let  pa  be  the  projected  length  of  the  side  a  on  the  line  (;  this  length  is  to  be 
counted  as  negative  if  /  makes  an  angle  of  more  than  with  a,  in  the  sense  shown  in 
Figure  4.4.  The  second  proposed  cost  is 

V2(i,b,c)  =  E(]pa  I  l  intersects  triangle  i  through  b  and  c). 
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To  justify  this  definition,  consider  the  union  of  all  the  triangles  intersected  by  the  line 
l.  This  forms  an  irregular  strip  in  the  plane.  The  two  edges  of  this  strip,  one  on  each 
side  of  /,  are  made  up  by  those  edges  not  intersected  by  l.  The  total  projection  length 
of  these  edges  on  /,  neglecting  end  effects,  is  equal  to  twice  the  length  of  /,  and  the 
sum  of  all  the  quantities  like  \pa  is  equal  to  L.  Hence  S2  =  £/(!,«]  ,e2)  V2(i,el  ,e2)  is 
an  unbiased  estimate  of  L. 

The  sense  in  which  /  is  a  random  line  is  as  follows.  Choose  an  origin  O  in  the 
plane  and  let  R  be  large  enough  to  ensure  that  the  triangle  ABC  is  entirely  enclosed 
within  the  circle  centre  O  and  radius  R.  Now  construct  l  such  that  the  perpendicular 
from  O  to  l  has  orientation  uniformly  distributed  on  (0,2 x)  and  length  uniformly 
distributed  on  (0,tf ).  This  is  the  distribution  of  l  conditional  on  l  intersecting  triangle  i 
through  sides  b  and  c  if  the  pixel  grid  and  associated  triangulation  is  placed  down  in  a 
random  position  and  at  a  random  orientation  relative  to  the  line  l.  By  standard 
stochastic  geometry,  the  quantities  Vl  and  V2  will  be  independent  of  the  choice  of  R. 


A_  - 


Figure  4.5:  A  random  line  intersecting  the  dual  mangle. 


We  now  calculate  and  V2.  Let  0  be  the  angle  between  l  and  BC,  measured  as 
shown  in  Figure  4.5.  The  first  step  is  to  find  the  density  f(9 )  of  ©  conditional  on  / 
intersecting  b  and  c.  Note  first  that,  of  necessity,  -C  <  0  <  S;  consider  first  the 
range  0  <  9  <  B. 

For  such  9,  l  will  intersect  c  and  b  if  and  only  if  it  intersects  c.  The  set  of  lines  at 
orientation  9  that  intersect  c  make  up  a  strip  of  width  c  sin  ( B-9 )  and  so  we  have 

f(9)  «  c  sin(B-B)  for  0  <  9  <  B. 

For  -C  <  9  <  0,  a  similar  argument  yields 

f{9)  «  b  sin(C+B)  for  0  <  -9  <  C . 

To  calculate  the  constant  of  proportionality,  we  note  that 
J*  c  sin (B-9)d9  +  j0^  b  sin (C+9)d9  =  c  j ^  sin  < pdq>+  b  |C  sin  q>d<p 
=  b  +  c-ccosB  -bcosC  -  b  +  c-a. 


and  hence  we  have 
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fie)  = 


| c  sin  (B-9)/(b+c-a) 

I b  sin  ( C+9)/(b+c-a ) 


o  <  e  <  b 
-c  <  e  <  o 

otherwise 


To  calculate  V; ,  consider  first  $  >  0.  Given  that  0  =  9  and  that  l  intersects  c  and  6, 
the  expected  value  of  L,  is  half  its  value  when  0  =  9  and  l  passes  through  B.  This 
length  is,  by  the  sine  formula,  equal  to  c  sin  A/sxn(C+9).  Hence  we  have 

V,  =  J®  (Jc  sin  A/  sin <C+9))/(9)d9 

+  }°c  Hb  sin  A/sin(B-9))/(9)d9  (4.2) 


A 


Figure  4.6:  The  random  line  passes  through  B. 


To  calculate  these  integrals.  First  substitute  <p  =  C  +  9  and  use  the  fact  that 
A  +  S  +  C  =  t  to  give 

J®  sin(fl-9)/sin(C+9)d9  =  j®>Csin(B+C-ip)/  sin  < p  dtp 

=  sinA  !(5+Ccotpdc>  +  BcosA  =  sinA(log sinA  -  logsinC)  +  flcosA. 

Substituting  this  and  the  corresponding  formula  for  the  second  integral  into  (4.2)  gives, 
after  some  trigonometry, 

( b+c~a)Vx  =  \(c2B  +  b2C)sin  2A 

+  £sin2  A  {(b2  +  c2)log  sin  A  -  c2  log  sin  C  -  b2  iog  sin  B  j .  (4.3) 

This  formula  is  complicated  and  inelegant,  and  it  turns  out  that  V2  is  much  more 
simply  expressed.  Given  that  0  =  <p,  we  have  pa  =  a  cos  <p,  and  hence 

V2  =  \a  |®  cos  9/(9)d9 

•  =  ib+c-a)~x  ||®  \ac  cos  9  sin (B-9)dO  +  |QC  \ab  cos  9'  sin(C-9')d9'l. 


(4.4) 
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The  first  integral  in  (4.4''  is  equal  to 

j^lacfsin  B  +  sin(B-20))dfl  =  \ac  B  sin  B  =  JAB 
where  A  is  the  area  of  the  triangle  t,  and  hence 

V2  =  J(B+C)A/(h+c-a)  (4.5) 

Thus  it  is  clear  that  the  formula  for  V2  is  very  simple  and  more  appealing  than  that  for 

A  second  reason  for  preferring  V2  to  Vx  will  be  elaborated  in  Section  5  below.  It 
is  shown  there  that,  for  square  pixel  arrays,  the  projection  approach  produces  the 
minimum  variance  unbiased  estimate  of  line  length. 

C.  Line  Length  Associated  with  Endings  and  Branches 

We  now  turn  to  the  problem  of  ascribing  a  cost  for  the  edge  length  associated 
with  configurations  1  and  3  of  Figure  4.2.  From  now  on  we  shall  restrict  our  attention 
to  the  'projection’  cost  V2.  As  previously  explained,  the  union  of  all  triangles 
intersected  by  the  line  /  forms  an  irregular  strip  in  the  plane  and  the  sum  of  the 
projection  lengths  of  the  edge:  of  this  strip  in  the  direction  of  l  is  approximately  twice 
the  length  of  I.  This  approximation  becomes  exact  if  the  strip  is  terminated  with  edges 
AF  and  BF,  as  shown  in  Figure  4.7,  and  the  corresponding  edges  at  its  other  end. 


Figure  4.7:  An  end  of  line  /  in  the  dual  triangle  ABC. 
The  vector  u  is  a  unit  vector  in  the  direction  of  /. 


Let  pF  be  the  sum  of  the  projections  of  AF  and  BF  in  the  direction  of  I  or,  more 
formally, 

pF  =  ( Ap  +  bP).£, 

where  it  is  a  unit  vector  in  the  direction  of  /  (see  Figure  4.7)  and  Ap  and  BF  are 
vectors.  This  definition  automatically  provides  a  correct  treatment  of  any  negative 
projections.  We  define 

V2(i,c)  =  E(]pF  |  /  intersects  c  and  terminates  in  triangle  i). 
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where  the  distribution  of  /  and  its  end  point  are  as  described  shortly.  Repeating  the 
argument  leading  up  to  (4.1)  we  see  that  the  sum  of  costs  V2  is  now  an  exactly 
unbiased  estimate  of  tne  length  of  line  /  when  the  line  is  placed  at  random  on  an 
infinite  pixel  grid.  (An  infinite  grid  is  needed  to  avoid  problems  at  the  window  edge.) 

In  calculating  V2  for  this  case  the  distribution  of  line  /  is  as  described  previously 
but  with  an  extra  multiplicative  factor  proportional  to  the  length  of  the  intersection  of  l ' 
and  the  triangle  ABC ,  only  lines  intersecting  AS  are  considered  and  the  right  hand  end 
of  the  line  is  distributed  uniformly  along  the  length  of  the  line  l  inside  the  triangle. 
Again,  this  corresponds  to  the  conditional  distribution  of  the  line  and  its  end,  given 
that  the  line  enters  triangle  ABC  through  edge  c  and  terminates  inside  the  triangle, 
when  the  pixel  grid  is  placed  in  a  random  position  and  orientation.  A  long  and  tedious 
calculation  gives  the  value  of  E(\pp)  for  this  case 

V2  =  ^K{2a3+a  b+ab1+2bi  -  3(a2+b2)c  -  (a-b)(bcosA  -  acosB)c 

-  3a  2  c  cos  A  log  tan  JA  -  3b2ccosB  logtan  JB  -  3c(a2  cosA  +  b2  cosS)  logtaniC  ) 

where 

K  =  (a2(B  cot  A  +  log  — )  +  b2(A  cot  B  +  iog  — )  +  abC  cosec  Z  )_1 . 
c  c 

This  formula  simplifies  in  special  cases,  for  example,  on  a  tegular  hexagonal  grid  in 
which  all  dual  triangles  are  equilateral  of  side  a,  V2=(3V3/8*)alog3  . 

A  similar  calculation  could  be  performed  for  the  branch  in  configuration  3  of 
Figure  4.2.  We  shall  not  complete  such  a  calculation  but  we  shall  describe  the  general 
approach.  A  typical  configuration  in  the  dual  space  is  depicted  in  Figure  4.8  and  the 
appropriate  definition  of  pF  is 

pF  =  (A?  +  bP).u{  +  (Cp  +  A^).5 2  +  XBp  +  C>).5*3. 


Figure  4.8:  Three  lines  meeting  in  the  dual  triangle  ABC 
and  their  associated  unit  vectors. 


-  19  - 


When  calculating  V2  =  E(\pF)  it  should  be  noted  that  configurations  such  as  that  in 
Figure  4.9  also  give  rise  to  the  same  configuration  of  pixel  edges;  this  does  not  cause 
a  serious  problem  and  the  total  edge  length  will  be  estimated  correctly  as  long  as  these 
cases  are  treated  as  meeting  in  ABC.  Note  that  for  E(\pF)  to  be  properly  defined  it  is 
necessary  to  introduce  a  joint  distribution  for  the  angles  between  the  three  lines, 
preferably  by  appealing  to  specialised  knowledge  of  the  image  in  question. 


Figure  4.9:  Three  lines  meeting  outside  the  dual  triangle  ABC  but  still 
producing  three  edges  meeting  at  the  vertex  associated  with  ABC. 


V.  Regular  Arrays  Revisited 

In  the  last  section  we  defined  two  different  ways  of  obtaining  penalties  for 
continuation  configurations.  One  of  these  was  based  on  the  length  of  the  intersection 
of  a  region  in  the  dual  triangulation  with  a  random  line,  and  the  other  on  the  length  of 
projection  of  such  a  region  on  a  random  line.  It  turned  out  that  the  projection  penalty 
gave  a  much  more  elegant  result.  In  this  section,  we  shall  apply  the  intersection  and 
projection  ideas  to  the  regular  square  lattice  considered  earlier,  and  to  rectangular  and 
hexagonal  lattices. 


A  h  B 


Figure  5.1:  A  random  line  intersecting  a  square  in  the  dual  lattice. 
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A.  Square  Lattices 

Our  aim  is  to  obtain  costs  for  the  "turn"  and  "continuation"  configurations  as 
illustrated  in  Figure  3.1.  The  dual  of  the  square  lattice  is  itself  a  square  lattice,  and 
the  pan  of  the  dual  corresponding  to  a  clique  is  a  single  square  of  side  h  as  in  Figure 
5.1.  The  configuration  of  edges  in  the  original  clique  will  be  a  straight  continuation  if 
/  crosses  AD  and  SC.  We  find  the  distribution  of  9  conditional  on  l  being  a  random 
line  under  this  additional  cot  iition. 


Figure  5,2:  Lines  of  inclination  9  crossing  AD  and  BC 
form  a  strip  of  width  ^2/t  sin(  J — (9| ). 


For  -'*<#<  7>  the  set  of  lines  crossing  AD  and  BC  will  be  a  strip  of  width 
/r/2  sin  (7  -  |0|)  by  some  easy  trigonometry.  Hence  the  density  fx{9)  of  0 
conditional  on  /  crossing  AD  and  BC  will  satisfy 

/i(»)  =  (2-V5rl  sin  (7-|0j),  -7<0<7 

using  simple  calculus  to  find  the  constant  of  proportionality.  The  intersection  length  r 
is  equal  to  h  sec  9,  and  hence  the  expected  intersection  length  is 

1*^/1  sec 9fl(9)d9  =  sec  9  sin  (7 -9)d9 

~  (V2-1)'1  tan  9)d9  =  (V2 - 1 )~ 1  [0-log  sec  9]tf* 

=  (V2-irl(7-il°g2). 

Thus  the  "intersection"  penalty  for  a  configuration  of  type  3  in  Figure  3.1  would  be 
(7-i  log  2)  h/(V 2-1)  =  1 .06  h. 

To  find  the  "projection"  penalty  for  such  a  configuration,  note  that  the  appropriate 
generalization  of  the  projection  argument  given  in  Section  4  is  to  take  as  penalty 
^(projection  of  AB  and  DC)  because  both  AS  and  DC  will  be  edges  of  the  irregular 
strip  formed  by  the  union  of  those  dual  squares  intersected  by  /.  Both  AS  and  DC 
have  projection  length  h  cos  9  on  /,  and  so  the  "projection"  penalty  for  a  configuration 
of  type  3  will  be 
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j**4J i  cos efx(9)de  =  (l~i^2)  1  h  j^4  cos  9  sin  (j-9)d$ 

=  (2-V2r'A  J0*/4  (sin  7  -  sin  (2<9~f )} 

=  =  kh  =  0.95  h 

O 

where  k  =  y  tany  as  defined  in  Section  3. 

To  find  the  penalties  for  "turn"  configurations,  the  work  of  Section  4  can  be  used 
almost  directly,  by  noticing  that  both  the  "intersection"  and  "projection"  penalties  will 
be  the  same  as  those  obtained  there,  for  the  case  of  a  line  crossing  the  two  short  sides 
of  an  isosceles  right-angled  triangle.  Thus  we  set  a  -  hs2,  b  =  c  =  h, 
S  =  C  =  7  and  A  =  y  in  the  formulas  (4.3)  and  (4.5). 

We  obtain  as  the  intersection  penalty  for  the  turn  configuration 
Vj  =  \h  log  2/(2 -v2)  =  0.59  h  and  for  the  projection  penalty  V2  =  2 ~^2kh  =  0.67  h. 
It  is  noteworthy  that  the  projection  approach  yields  penalties  for  the  two  configurations 
that  are  identical  to  •  ,;.e  obtained  in  Section  3.  Thus,  by  the  argument  of  Section  4, 
up  to  the  approximation  of  ignoring  end  effects,  the  projection  penalty  is  the  minimum 
variance  unbiased  estimate  of  line  length  calculated  from  cliques  of  four  line  sites 
only.  The  intersection  approach  yields  a  higher  cost  for  straight  continuation  and  a 
lighter  cost  for  turns  and,  thus,  has  greater  than  necessary  variability  with  orientation 
in  the  cost  of  a  long  straight  line  boundary. 

Following  the  development  of  Section  4,  we  now  use  the  projection  cost  to  assign 
costs  for  edge  length  in  configurations  1,  4,  and  5  of  Figure  3.1.  For  an  ending 
(configuration  1),  as  shown  in  Figure  5.3,  the  required  cost  is 

V2  =  E(\(AF  +  DF).u  \  l  crosses  AD  and  terminates  in  ABCD). 

The  joint  distribution  of  /  and  F  is  essentially  as  for  the  case  of  an  ending  in  a  triangle 
treated  in  Section  4.3  and  routine  calculation  gives 

V2  =  f(V2  -  1)  +  log  (v2  +  \))hJx. 


A 


B 


C 


Figure  5.3:  A  line  l  ending  in  the  square  ABCD  in  the  dual  lattice. 
The  unit  vector  u  is  in  the  direction  of  1. 
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Figure  5.4:  Line  l  crossing  dual  edge  AD  and  leaving  the  window. 
Dotted  lines  show  edges  present  in  the  edge  process. 

The  unit  vector  H  is  in  the  direction  of  /. 


I 

j 


A  case  not  yet  mentioned  is  that  of  a  line  ending  at  the  edge  of  the  window. 
Figure  5.4  shows  a  line  meeting  the  window  edge  after  crossing  edge  AD  of  the  dual 
square  ABCD.  To  complete  the  irregular  strip  containing  that  pan  of  /  within  the 
window  we  need  to  add  edges  AF  and  DF.  Thus  the  cost  of  edge  GF  in  the  one  edge 
clique  associated  with  G  is 

V2  =  eqlaP  +  F~b)  ,i?|  l  crosses  AD  and  then  leaves  the  window). 

Strictly  speaking,  this  depends  on  the  position  of  G  relative  to  the  comers  of  the 
window.  The  calculation  is  simplified  if  we  assume  an  infinite  window  edge,  in  which 
case  V2  -  it  hi  A. 


Figure  5.5:  Three  lines  meeting  in  or  near  dual  square  ABCD 
and  associated  unit  vectors. 


A  branch  (configuration  4)  arises  when  three  lines  meet.  In  all  four  cases  shown 
in  Figure  5.5  the  branch  is  associated  with  dual  square  ABCD  and  the  sum  of  the 
projection  lengths  required  to  close  off  the  irregular  strips  containing  /,  /2  and  is 

pF  =  (DF  +  AF).Ui  +  (AF  +  BF).u2  +  (BF  +  CF).uz. 
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In  calculating  V2  =  E(\pF),  the  expectation  is  with  respect  to  a  uniform  distribution  of 
the  point  of  intersection  F  in  the  plane  and  of  the  orientation  of  the  set  of  lines 
/[ ,  l2  and  /3,  conditional  on  sides  DA,  AB  and  BC  but  not  CD  being  intersected.  For 
a  given  set  of  angles  between  the  three  lines  containing  no  acute  angle,  calculation  of 
V2  by  numerical  integration  or  Monte  Carlo  estimadon  is  straightforward.  Note, 
however,  that  if  the  lines  do  meet  at  acute  angles  the  associated  edge  process  can  be 
more  complex  than  a  single  branch,  as  shown  in  Figure  5.6. 


Figure  5-6:  Lines  l\,  h  and  li  meeting  at  a  point.  The  dotted  lines 
representing  elements  of  the  associated  edge  process  include  a 
crossing  and  three  branches. 


For  the  special  cases  of  three  angles  of  2jt/3  and  angles  of  nil,  k!2  and  k  between  the 
lines  /[ ,  l2  and/3,  V2  =  1.32  h  and  1.45  A  respectively.  The  value  V2  =  1.4  A 
associated  with  configuration  4  in  Table  3.1  was  chosen  as  a  compromise  between 
these  two  cases. 

A  crossing  (configuration  5)  can  arise  when  four  lines  meet  but  this  will  not 
always  be  the  case.  In  Figure  5.7b  the  meeting  of  four  lines  produces  two  adjacent 
branches  rather  than  a  crossing  in  the  edge  process:  the  projection  costs  calculated  for 
a  branch  formed  by  the  meeting  of  three  lines  are  inappropriate  in  this  case  but  a 
proper  treatment  would  be  possible  if  the  clique  size  were  enlarged.  When  the 
meeting  of  four  lines  does  produce  a  crossing  in  the  edge  process  the  configuration 
must  be  of  the  type  shown  in  Figure  5.7a,  and  the  projection  cost  is 

v2  =  e[\{(dP  +  aP).hx  +  (aP  +  bJ).u"2  +  (bP  +  cP).u2  +  (cP  +  dP ).i?j] 

where  F  is  distributed  uniformly  over  the  interior  of  ABCD  and  the  orientation  of  the 
set  of  four  lines  is  uniform  conditional  on  one  line  intersecting  each  edge.  For  the 
case  of  four  lines  meeting  at  right  angles  and  producing  a  crossing  in  the  edge  process 
numerical  integration  gives  V2  =  1.94  h  and  this  is  the  value  used  for  configuration  5 


-  24  - 


in  Table  3.1. 


(a)  (b) 


Figure  5.7:  Four  lines  meeting  at  a  point.  Dotted  lines  show  elements  of 
the  associated  edge  process.  Unit  vectors are  in  the 
directions  of  lines  l\ . U. 

B.  Rectangular  Pixels 

In  this  section  we  consider  rectangular  pixels  of  length  hx  and  breadth  h2. 
Corresponding  to  the  six  possible  types  of  configurations  of  edges  in  Figure  3.1  there 
are  now  nine  possible  essentially  different  configurations,  since  there  are  two  types 
each  of  endings,  continuations  and  branches.  For  brevity  we  shall  concentrate  on  the 
costs  for  continuations  and  turns.  The  cost  of  a  turn  is  calculated  by  applying 
.formula  (4.5)  to  a  right-angled  triangle  with  short  sides  hx  and  h2,  yielding  the 
quantity  jhlh2/{hl  +  h2  -  {hx+h2)^2}. 


h  1 _ 

h2 

Figure  5.8 

The  cost  of  a  continuation  of  the  kind  shown  in  Figure  5.8  is  calculated  as  in 
Section  5.1.  Let  80  =  ta n~l(h2/hl).  For  \8 1  <  90  the  set  of  lines  of  inclination  d 
intersecting  AD  and  BC  forms  a  strip  of  width  proportional  to  sin(0o-|0|),  and  the 
projection  length  of  AS  and  CD  on  a  line  of  inclination  8  is  h{  cos  9.  Hence 
arguments  exactly  analogous  to  those  of  Section  5A  give  as  the  cost  of  a 
"continuation"  as  shown  in  Figure  5.8  the  quantity 


D  hi  C 


:  A  configuration  for  a  clique  in  the  rectangular  pixel  case. 
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9  9 

h\  f  °cos  9  sin  ( 90-9)d9  /  f  °  sin  ( 9-90)d8 

JQ  *C 

9  9 

-  i^t  J0°  (sin  $o+  %m(9^-29)}dd  I  JQ°  sin  9'  d9' 

—  \h\90  sin  9q/(  1-  cos  90).  (5.1) 

The  other  type  of  continuation,  consisting  of  two  edges  of  length  h2,  will  cost  an 
amount  obtained  by  substituting  h2  for  h\  and  Jt  -  90  for  90  in  (5.1),  viz. 
^2(x-90)  cos  90/(\-  sin  9a). 

C.  Hexagonal  Pixels 

The  presence  of  a  single  type  of  first  order  neighbour  makes  the  use  of  hexagonal 
pixel  grids  appealing,  particularly  in  applications  such  as  tomography  where  physical 
properties  of  the  imaging  system  do  not  define  a  natural  pixel  grid.  The  dual  space  of 
a  grid  of  regular  hexagons  with  sides  of  length  /  contains  equilateral  triangles  of  side 
l  V3.  Applying  the  formulas  of  Section  4,  the  projection  penalty  and  the  edge  length 
penalty  for  a  continuation  in  the  dual  space  are  both  equal  to  J xl.  Since  only  one  form 
of  branching  is  possible,  the  region  counting  penalty  and  the  expected  projected  edge 
length  penalty  for  a  branch  will  always  appear  together  and  it  does  not  help  to  evaluate 
the  latter  quantity.  Thus,  the  penalties  for  cliques  of  type  0,  1,  2  and  3,  as  depicted  in 
Figure  4.2,  are  0,  \p,  J xl/3  and  \p  respectively. 

VI.  Conclusion 

Some  simple  geometrical  considerations  have  made  it  possible  to  define  edge 
process  penalties  which  are  approximately  invariant  to  the  scale  and  orientation  of  the 
pixel  grid,  and  which  can  in  addition  be  generalised  to  irregular  pixels.  The  general 
idea  of  evolving  penalties  based  on  a  conditional  expected  projection  length  has  the 
advantage  that  consistent  penalties  can  be  written  down  for  cliques  of  different  kinds 
that  appear  in  different  parts  of  the  same  pattern. 
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Flexible  parsimonious  smoothing  and  additive  modeling. 
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1.  Introduction 

In  this  paper  we  shall  develop  an  approach  to  regression  fitting  based  on  an  extremely  simple 
idea.  Consider  first  of  all  the  univariate  case  where  one  has  .V  pairs  of  measurements  ( y, ,  Xi), 
i  =  1,  •  •  • ,  N,  and  it  is  supposed  that,  as  usual, 

Y  =  f(X )  +  error  (1) 

where  /  is  a  function  to  be  estimated,  and  the  error  is  assumed  to  have  zero  mean;  its  distribution 
may  well  depend  on  the  value  of  X . 

Regression,  or  curvefitting,  is  performed  for  a  number  of  reasons.  The  value  f(X)  is  the 
conditional  expectation  of  Y  given  the  value  X,  and  so  may  be  used  as  an  estimate  of  the  response 
Y  for  future  observations  where  only  the  value  of  the  predictor  variable  X  is  measured.  The 
function  /  can  also  be  studied  to  try  to  gain  insight  into  the  predictive  relationship  between  Y 
and  X .  By  far  the  most  commonly  used  approach  is,  of  course.  linear  regression.  It  is  assumed  - 
rightly  or  wrongly  -  that  /  is  a  linear  function  f(X)  =  a.Y  4-  6,  and  then  the  parameters  a  and  6 
are  estimated  by  least  squares. 

What  should  be  done  if  the  data  are  not  well  approximated  by  a  straight  line  fit?  One  way 
forward  is  to  allow  /  to  be  a  piecewise  linear  function,  made  up  of  straight  line  pieces  that  join 
together  continuously  at  points  called  knots.  If  the  knot  positions  are  fixed  before  looking  at  the 
data  response  values  j/,-,  then,  at  the  expense  of  introducing  more  parameters  into  the  problem, 
we  will  be  able  to  fit  a  wider  range  of  data  sets  reasonably  well,  while  still  including  simple  linear 
regression  as  a  special  case.  Furthermore  all  the  necessary  parameters  can  be  found  and  inference 
performed  using  standard  linear  regression  methods  (see  Agarwal  and  Studden.  1980). 

In  terms  of  flexibility,  much  greater  dividends  arise  if  the  knot  positions  are  not  fixed  in  advance, 
but  are  themselves  allowed  to  depend  on  the  data,  including  the  response  values.  In  this  case  an 
enormously  wide  range  of  models  can  be  closely  approximated  using  piecewise  linear  functions  / 
with  a  small  number  of  knots.  There  is  a  computational  penalty  to  be  paid,  because  some  sort 
of  search  procedure  needs  to  be  used  to  find  suitable  positions  for  the  knots.  In  this  paper,  we 
describe  a  stepwise  procedure  that  makes  it  feasible  to  fit  piecewise  linear  models  with  knot  positions 
determined  by  the  data,  and  we  also  discuss  practical  strategies  for  deciding  how  many  knots  to  use. 

One  of  the  attractive  features  of  our  method  is  that  it  can  very  easily  be  extended  to  the 
multivariate  case.  Suppose  that  the  observations  are  of  the  form  /y,.x,)  where  each  x,  is  now  a 
p- vector  (xit.xjj,  •  •  • .  xpi).  It  is  assumed,  as  before,  that  the  variable  V"  depends  on  X  by  a  relation 


of  the  form 


Y  =  /(X)  +  error  =  f(X\,Xt,  ■  •  •  ,XP)  +  error. 

The  way  that  we  make  use  of  our  ideas  about  piecewise  linear  fitting  is  to  concentrate  on  the  case 
where  /  is  a  sum  of  functions  of  the  individual  components  of  X. 

/(X)  =  MX,)  +  MXi)  +  •  •  •  +  fp(Xp).  (2) 

This  approach  is  known  as  additive  regression  or  additive  modeling,  and  replaces  the  problem 
of  estimating  a  function  /  of  a  p-dimensional  variable  X  by  one  of  estimating  p  separate  one¬ 
dimensional  functions  fj.  Although  not  completely  general,  additive  models  are  often  effective: 
they  are  easy  to  interpret,  and  represent  a  very  important  step  beyond  the  simple  linear  model. 

It  turns  out  that  our  piecewise  linear  fitting  method  can  be  applied  easily  in  the  additive 
modeling  context.  Each  of  the  individual  functions  fj  can  be  modeled  as  being  piecewise  linear 
with  knots  that  depend  on  the  data,  including  the  response  values.  Our  stepwise  fitting  procedure 
enables  all  the  functions  f,  to  be  constructed  together,  at  little  extra  cost  than  for  a  univariate 
problem. 

The  paper  is  set  out  as  follows.  In  Section  2.0  we  give  a  general  discussion  of  smoothing 
methods.  We  go  on  in  Sections  2.1  and  2.2  to  develop  our  approach  in  the  univariate  case.  Com¬ 
putational  aspects  are  discussed  in  Section  2.3.  The  important  question  of  model  selection  -  how 
many  knots  to  use  -  is  considered  In  Section  2.4.  In  Section  2.5  we  provide  a  simple  extension  that 
produces  models  with  continuous  first  derivatives  (if  desired).  In  Section  3  we  explain  how  the  ad¬ 
ditive  modeling  approach  enables  our  method  to  be  applied  in  the  multivariate  case,  and  in  Section 
4  we  demonstrate  how  confidence  intervals  for  the  estimated  function!  s)  can  be  obtained.  Finally 
in  Section  5  a  number  of  practical  examples  display  the  scope  and  power  of  our  method  as  a  data- 
analytic  tool. 

2.0  Smoothing 

VVe  first  consider  the  case  of  a  single  predictor  variable,  p  =  1.  The  smoothing  problem  has 
been  the  subject  of  considerable  study,  especially  in  recent  years.  The  lack  of  flexibility  (ability  to 
closely  approximate  a  wide  variety  of  predictive  relationships)  associated  with  global  fitting 

j 

fjlz)  =  a0  +  Y^ajpj(I)  (3) 

;  =  i 

where  the  Pj  are  predefined  functions  (usually  involving  increasing  powers  of  r)  has  led  to  devel¬ 
opments  in  two  general  directions:  piecewise  polynomials  and  lo:al  averaging.  The  basic  idea  of 
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piecewise  polynomials  is  to  replace  the  single  prescribed  function  fj(x )  (of  possibly  high  order  J) 
defined  over  the  entire  range  of  X  values,  with  several  generally  low  order  polynomials,  each  defined 
over  a  different  subinterval  of  the  range  of  X.  The  points  that  delineate  the  subintervals  are  called 
knots.  The  greater  flexibility  of  the  piecewise  polynomial  approach  is  gained  at  some  expense  in 
terms  of  local  smoothness.  The  global  function  is  generally  taken  to  be  continuous  and  have  con¬ 
tinuous  derivatives  to  all  orders.  Piecewise  polynomials  on  the  other  hand  are  permitted  to  have 
discontinuities  in  low  order  derivatives  (and  sometimes  even  the  function  itself)  at  the  knots.  The 
tradeoff  between  smoothness  and  flexibility  is  controlled  by  the  number  of  knots  at  which  disconti¬ 
nuities  are  permitted  and  the  order  of  the  lowest  derivative  allowed  to  be  discontinuous.  The  most 
popular  piecewise  polynomial  fitting  procedures  are  based  on  splines  (de  Boor.  1978).  An  .\/-spline 
consists  of  piecewise  polynomials  of  degree  M  constrained  to  be  continuous  and  have  continuous 
derivatives  through  order  M  -  1.  Smith  (1982)  presented  an  adaptable  knot  placement  strategy 
for  spline  fitting  based  on  forward/backwards  variable  subset  selection. 

Local  averaging  smoothers  directly  use  the  fact  that  /(x)  is  intended  to  estimate  a  conditional 
expectation,  E(Y\x).  These  estimates  take  the  form 

v 

f(x)  -  ^  /7(r,x,)y,  (4) 

i=t 

where  H(x,  x')  (called  the  kernel  function)  usually  has  its  maximum  value  at  i'  =  x  with  its  absolute 
value  decreasing  as  |x'  -  x\  increases.  Therefore,  /( i)  is  taken  to  be  a  weighted  average  of" the  y,, 
where  the  weights  are  larger  for  those  observations  that  are  close  or  local  to  x.  A  characteristic 
quantity  associated  with  a  local  averaging  procedure  is  the  local  span  s(x),  defined  to  be  the  range 
centered  at  x  over  which  a  given  proportion  of  the  averaging  takes  place. 

r  r+ j(i)/2 

/  H(x,  x')dx'  =  a, 

Jz-,{z)/1 

with  a  a  predefined  constant  fraction  (i.e.,  a  =  0.68  or  0.95).  If  the  defining  property  holds  for 
more  than  one  value  of  s(x),  then  the  smallest  such  value  is  taken.  Many  local  averaging  smoothers 
take  the  span  to  be  constant  over  the  entire  range  of  x.  s(x)  =  A,  (Rosenblatt.  1971).  Others 
take  it  to  be  inversely  proportional  to  the  local  density  of  x  values.  s(x)  =  X/p(x)  (Cleveland. 
1979).  Smoothing  splines  (Reinsch.  1967)  are  in  fact  local  averaging  procedures  where  the  span 
turns  out  to  be  approximately  s(x)  ~  A/[p(x))1^''  (see  Silverman.  1984.  1985).  (The  quantity  A 
represents  a  parameter  of  these  procedures.)  Recently,  adaptable  span  local  averaging  smoothers 
have  been  introduced  that  estimate  optimal  local  span  values  based  on  the  values  of  the  responses. 
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y, .  (Friedman  and  Stuetzie,  1982,  Friedman,  1984).  The  span  function  s(  x )  controls  the  continuity- 
flexibility  tradeoff  for  local  averaging  smoothers.  For  the  nonadaptable  smoothers  this  is  in  turn 
regulated  by  A,  the  smoothing  parameter  of  the  procedure. 

There  is,  of  course,  a  connection  between  the  piecewise  polynomial  and  local  averaging  ap¬ 
proaches  to  smoothing.  For  a  given  knot  placement,  piecewise  polynomial  curve  estimates  can 
also  be  expressed  in  the  form  given  by  (4)  (as  can  global  fits).  There  will  be  a  characteristic  local 
span  associated  with  the  corresponding  kernel.  The  more  flexible  the  smoother  is  to  local  varia¬ 
tion,  the  smaller  will  be  the  span.  The  basic  difference  between  the  two  approaches  has  to  do  with 
how  the  span  is  specified.  With  local  averaging  smoothers  the  span  parameter  A  usually  enters  fun¬ 
damentally  into  the  definition  of  the  kernel  function  (or  some  other  aspect  of  the  definition  of  the 
smoother)  and  is  either  directly  set  by  the  user  or  some  automated  procedure  (i.e.  cross- validatory 
choice)  is  employed  for  its  selection.  For  piecewise  polynomial  smoothers  it  is  indirectly  regulated 
by  the  choice  of  the  number  and  placement  of  the  knots,  and  the  degree  of  continuity  required  at 
the  knot  positions. 

The  trade-off  between  continuity  and  local  flexibility  is  a  fundamental  one  that  directly  affects 
the  statistical  performance  of  the  smoother  as  a  curve  estimator.  If  one  assumes  that  there  exists  a 
population  trom  wmcn  the  data  can  be  regarded  as  a  random  sample,  then  the  goai  is  to  estimate 
the  conditional  expectation  E(Y \X  =  x)  for  the  population.  Even  if  this  is  not  the  case  the 
goal  is  usually  to  obtain  curve  estimates  f(x)  that  have  good  (future)  prediction  ability  for  new 
observations  not  part  of  the  training  sample  used  to  obtain  the  estimate. 

Increased  flexibility  provides  the  smoothing  procedure  with  an  increased  ability  to  fit  the  data 
at  hand  more  closely.  This  may  or  may  not  be  good,  depending  on  the  extent  to  which  this  training 
sample  is  representative  of  the  population  of  future  observations  to  be  predicted.  It  is  often  the  case 
that  fitting  the  training  data  too  closely  results  in  degraded  estimates  with  poor  future  performance. 
This  phenomenon  is  called  '‘over-fitting’’  and  can  be  quantified  through  the  bias-variance  trade-off. 
The  (future)  expected-squared-error  can  be  expressed  as 

E[f(x)  -  /( x)]2  =  [ f(x )  -  Ef(x)}2  +  Var/(i).  (5) 

where  f'{x)  =  E(Y\X  =  j)  for  the  population  (future  observations).  The  expected  values  in  (5) 
are  over  repeated  replications  of  the  training  sample.  The  first  term  on  the  right  hand  side  of  (5) 
is  the  squared  distance  of  the  average  (expected)  curve  estimate  from  the  truth.  It  is  referred  to 
as  the  “bias-squared”  of  the  estimate.  As  the  smoother  is  given  more  flexibility  to  fit  the  data, 
the  bias-squared  generally  decreases  while  the  variance  increases.  Thus,  for  each  situation  there  is 


a  (usually  different)  optimal  flexibility.  If  a  smoothing  procedure  is  to  provide  good  performance 
over  a  wide  variety  of  situations,  it  must  be  able  to  effectively  adjust  its  flexibility-continuity  trade 
off  for  each  particular  application. 

Motivated  by  the  work  of  Smith  (1982),  we  present  an  adaptable  piecewise  polynomial  smooth¬ 
ing  algorithm.  It  uses  the  data  to  automatically  select  the  number  and  positions  of  the  knots, 
and  to  some  extent  the  degree-of-continuity  imposed  at  the  knots  as  well.  Although  quite  sim¬ 
ple  the  method  has  both  operational  and  performance  characteristics  that  are  quite  similar  to 
the  recently  proposed  adaptable  span  local  averaging  smoothers  (Friedman  and  Stuetzle,  1981. 
Friedman,  1984).  It  appears  to  have  superior  performance  in  low  sample  size  and/or  high  noise 
situations. 

Our  focus  is  on  accurate  estimation  of  the  curve  itself  and  not  necessarily  its  derivatives.  We 
therefore  restrict  our  attention  to  low  order  polynomials  with  weak  continuity  requirements  at  the 
knots.  This  has  the  effect  of  minimizing  the  average  effective  spa"n  (see  above)  for  a  given  number 
of  knots.  This  is  important  if  accurate  solutions  with  a  small  number  of  knots  are  required.  This  will 
be  the  case  in  high  noise  small  sample  environments.  Our  simplest  method  employs  piecewise  linear 
fitting  where  only  the  function  itself  is  required  to  be  continuous.  We  also  describe  a  companion 
method  that  fits  with  piecewise  cubic  functions  where  continuous  first  -  but  not  second  -  derivatives 
are  imposed.  This  has  the  advantage  of  producing  curves  that  are  more  cosmetically  appealing,  if 
less  interpretable.  It  may  sometimes,  but  not  always,  produce  slightly  more  accurate  estimates  in 
situations  where  the  second  derivative  of  the  underlying  true  curve  is  nowhere  rapidly  varying. 

Our  estimate  of  future  prediction  error  -  to  be  minimized  -  is  based  on  the  generalized  cross- 
validation  measure  (Craven  and  Wahba,  1979).  A  brief  explanation  of  generalized  cross-validation 
(GCV)  is  given  by  Silverman  (1985,  Section  4.1).  To  explain  GCV  it  is  first  necessary  to  mention 
cross-validation  (CV).  Let  K  be  the  number  of-knots  in  the  fitted  model.  The  CV  score  is  given  by 

1=1 

where  /_,  is  the  estimate  calculated  with  the  current  values  of  the  control  parameters  (in  our  case 
the  number  of  knots)  from  till  the  data  points  except  the  ith.  The  cross-validation  score  is  then  a 
function  of  A',  and  gives  a  measure  of  future  prediction  error  that  may  unfortunately  be  laborious 
to  calculate. 

GCV  can  be  thought  of  as  an  appropriate  version  of  CV  that  has  better  computational  prop¬ 
erties.  For  a  suitable  increasing  function  d(K)  of  the  number  or  knots,  the  GCV  Score  is  defined 
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by 

y 

GCV  =  jf  X>  -  /(^)]2/[  1  -  ^]2.  (6) 

1=1 

If  the  knot  placement  values  do  not  depend  upon  the  sample  response  values  y,-,  then  It  can  be 
shown  that  an  appropriate  choice  of  d(K )  is 

,v 

d(A')  =  £/?(*..*.) 

•=i 

where  A  is  the  kernel  function  (4).  For  piecewise  linear  fitting  by  least  squares  with  K  knots,  this 
turns  out  to  be  d(K)  —  K  +  1.  It  can  be  shown  that  this  choice  of  d(K)  makes  GCV  and  CV 
identical  in  certain  special  cases. 

For  adaptable  span  smoothers,  such  as  those  we  introduce  in  the  present  paper,  the  approx¬ 
imation  is  no  longer  good  because  of  the  additional  flexibility  given  by  the  free  choice  of  knot 
positions.  To  compensate  for  this,  we  use  (6)  as  an  approximation  with  d(K)  taken  to  be  a  more 
rapidly  increasing  function  of  K\  we  discuss  our  choice  of  d(  A')  in  Section  2.4  below. 

2.1  Piecewise  linear  smoothing 

We  describe  first  piecewise  linear  fitting.  For  a  fixed  number  of  knots  A".  we  aim  to  place  the 
knots  to  give  the  minimum  possible  value  of  the  average-squared-residual  (AS R) 

,  v 

AS R  =  y  X>  -  /(*.)]* 

1  =  1 

for  estimates  f(x)  chosen  to  be  continuous  and  piecewise  linear  with  the  given  knots.  Given  a  set 
of  knot  positions  there  are  a  number  of  ways  to  construct  the  corresponding  piecewise  linear  fit  that 
minimizes  the  .45  R.  These  involve  choosing  a  set  of  basis  functions  bk(x ),  1  <k<  A',  parameterized 
by  the  knot  locations,  that  have  the  required  continuity  properties.  The  curve  estimate  is  then 
taken  to  be 

K 

f{x)  =  a0  +  (7) 

k=\ 

The  values  of  the  coefficients  ao,  •  •  • .  a*  corresponding  to  the  piecewise  linear  curve  that  minimizes 
the  AS R,  are  obtained  by  a  ( A’  +  l)-parameter  linear  least-squares  fit  of  the  response  V’  on  the 
basis  function  set  6*(i). 

There  are  a  variety  of  basis  function  sets  with  the  proper  continuity  properties  for  piecewise 
linear  fitting.  The  most  convenient  for  our  purposes  is  the  set 


M-t)  =  (X  -  tic)  + 


T 


where  4  is  the  location  of  the  kth  knot  and  the  superscript  indicates  the  nonnegative  part.  The 
convenience  of  this  basis  stems  from  the  fact  that  each  basis  function  is  parameterized  by  a  single 
knot.  Thus,  adding,  deleting,  or  changing  the  position  of  a  knot  affects  only  one  basis  function. 

Optimizing  the  ASR  over  all  possible  (unequal) .locations  for  the  K  knots  is  a  fairly  difficult 
computational  task.  We  therefore  consider  the  subset  of  locations  defined  by  the  distinct  values 
realized  by  the  data  set.  This  has  the  effect  of  providing  more  potential  knot  locations,  and 
thus  more  potential  flexibility,  in  regions  of  higher  data  density  and  correspondingly  less  potential 
flexibility  in  sparser  regions.  This  attempts  to  control  the  variance,  since  regions  where  the  ratio 
of  data  points  to  knots  is  low  can  give  rise  to  locally  high  variance  in  the  curve  estimate. 

Even  the  (combinatorial)  optimization  of  the  ASR  over  this  restricted  set  of  locations  is 
formidable  owing  to  the  large  number,  iV,  of  potential  basis  functions  from  which  the  optimiz¬ 
ing  I{  must  be  chosen.  We  therefore  adopt  a  stepwise  strategy  for  knot  placement.  The  first 
knot  (k  =  1)  is  placed  at  the ‘position  that  yields  the  best  corresponding  piecewise  linear  fit. 
Thereafter,  each  additional  knot  is  placed  at  the  location  that  gives  the  best  piecewise  linear  fit 
involving  it  and  the  k  -  1  knots  that  have  already  been  placed.  Knots  are  added  in  this  manner 
until  some  maximum  number  of  knots  ( A'mlx)  have  been  positioned.  This  process  yields  a  sequence 
of  A'm»x  models,  each  one  with  one  more  knot  that  the  previous  one  in  the  sequence.  That  model 
in  the  sequence  with  smallest  GCV  as  defined  in  equation  (6)  is  chosen  for  further  consideration. 
The  number,  KmAX ,  of  models  to  be  considered  should  be  chosen  so  that  the  model  minimizing  the 
GCV  is  not  too  close  to  the  end  of  the  sequence.  Owing  to  the  forward  stepwise  nature  of  the 
procedure,  it  is  possible  for  the  GCV  sometimes  to  increase  locally  as  the  sequence  proceeds  and 
then  begin  to  decrease  again.  The  bound  A'm,x  should  be  large  enough  so  that  the  GCV  associated 
with  the  last  model  is  substantially  larger  than  the  minimizing  one  in  the  sequence. 

The  model  (with  A'*  knots;  0  <  A'"  <  A'mlx)  that  was  found  to  minimize  the  GCV  is  next 
subjected  to  a  backwards  stepwise  deletion  strategy.  Each  of  its  knots  is  in  turn  deleted  and  the 
corresponding  A"  -  1  knot  model  is  fitted.  If  any  of  these  fits  results  in  an  improved  GCV.  the  one 
with  the  smallest  is  chosen,  permanently  deleting  the  corresponding  knot.  This  procedure  is  then 
repeated  on  the  new  A"  -  1  knot  model,  deleting  a  knot  if  a  better  model  is  found.  This  continues 
until  the  deletion  of  any  remaining  knot  results  in  a  curve  with  higher  GCV . 

This  knot  deletion  strategy  can  sometimes  result  in  an  improved  model  because  of  the  nature 
of  forward  stepwise  procedures.  The  first  few  knots  must  deal  with  the  global  nature  of  the  curve 
without  the  benefit  of  the  additional  knots  that  come  later.  They  are.  therefore,  forced  to  ignore 
the  fine  structure.  Knots  that  are  added  later  in  order  to  model  the  fine  structure  can  in  aggregate 
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also  account  for  the  global  structure,  thereby  causing  the  initial  few  knots  to  be  redundant. 

Knot  deletion  as  described  above  seldom  results  in  a  dramatic  improvement  in  GCV .  It  is  worth 
doing  for  the  small  to  moderate  improvement  it  sometimes  provides,  because  it  adds  almost  nothing 
to  the  computational  burden.  All  necessary  calculations  can  be  done  using  summary  statistics 
(basis  covariance  matrix  and  response  covariance  vector)  already  calculated  for  the  original  A'"- 
knot  model.  No  further  passes  over  the  data  are  required. 

2.2  Minimum  span 

A  natural  strategy  would  be  to  make  every  distinct  observation  abscissa  value  a  candidate 
location  for  knot  positioning.  This  would  correspond  to  allowing  the  minimum  local  effective  span 
to  include  only  a  single  observation.  In  low  noise  situations  such  a  strategy  can  give  reasonable 
results.  In  high  noise  environments,  however,  this  can  lead  to  unacceptably  high  local  variance. 
A  solution  is  to  impose  a  minimum  effective  span  by  restricting  the  eligible  knot  locations.  The 
simplest  implementation  is  to  make  every  (distinct)  A/th  observation  (in  order  of  ascending  rr-value) 
eligible  for  knot  placement.  This  implementation  also  reduces  computation  by  a  factor  of  .V/.U  in 
the  absence  of  ties. 

A  reasonable  value  for  .V/.  as  a  function  of  .V,  can  be  obtained  by  a  simple  coin  tossing 
argument.  Suppose  y,  =  /" ( x, )  »  £,.  1  <  i  <  .V,  where  r,  is  a  mean  zero  random  variable 

with  a  symmetric  distribution.  Then  s,  has  an  equal  chance  of  being  positive  or  negative.  A 
smoother  will  be  resistant  to  a  run  of  length  £  of  either  positive  or  negative  errors  so  long  as 
its  span  in  the  region  of  the  run  is  large  compared  to  L.  If  not.  the  smoother  will  tend  to  follow 
the  run  and  hence  incur  increased  (variance)  error.  A  piecewise  linear  smoother  can  completely 
respond  to  a  run  without  degrading  the  fit  in  any  other  region  I  irrespective  of  the  placement  of 
the  other  knots)  if  it  can  place  three  knots  within  its  length.  It  can  partially  tespond  with  two 
knots  in  the  run  for  an  unfavorable  placement  of  the  other  knots  ( i . e .  one  of  them  close  to  the 
start  or  end  of  the  run).  This  wouid  suggest  that  the  minimum  knot  increment  M  should  satisfy 
,\1  >  Lmix/l  1  or  M  >  £ma*/2.5  to  be  conservative)  where  £max  is  the  largest  positive  or  negative 
run  to  be  expected  in  -V  binomial  trials. 

Let  prl  L)  be  the  probability  of  observing  a  run  of  length  L  or  longer  in  .V  tosses  of  a  fair  coin. 
For  small  values  of  this  probability  a  close  upper  bound  is  given  ’  y 


Pn  L 


d 


(Bradley,  1968).  One  can  choose  a  value  a  for  this  probability 

Pr(L)  —  a  (10) 

(say  a  =  0.05  or  0.01)  and  solve  (9),  (10)  for  the  corresponding  length  L( a).  Setting  .V/  =  L(a)/ 2.5 
would  (with  probability  a)  give  resistance  to  a  run  of  positive  or  negative  error  values.  Solving  (9), 
(10)  for  L(a)  would  have  to  be  done  numerically.  It  turns  out  that  the  simple  formula 

L(a)  =  -  log2[--^fn(l  -  a)] 

approximates  the  solution  quite  closely  (within  a  few  percent)  for  a  <  0.1  and  .V  >  15.  This 
suggests  that  a  conservative  increment  for  knot  placement  is  given  by 

•Vf(.V,  a)  =  -  log2[--^-fn(l  -  a)]/2.5  (11) 

with  0.05  <  a  <  0.01. 

2.3  Computational  Considerations 

For  each  k  >  0.  at  the  kth  step  in  the  forward  stepwise  procedure  described  in  Section  2.1  it  is 
necessary  to  optimize  the  position  of  tne  kth  knot  (over  all  eligible  locations)  given  the  positions  of 
the  k-  1  previously  placed  knots.  For  a  given  knot  placement  increment  M  there  are  (in  the  absence 
of  ties)  N/M  -  k  +  1  eligible  places  to  position  the  fcth  knot.  (The  positions  of  the  k  -  1  previously 
placed  knots  are  not  eligible.)  At  each  such  potential  new  knot  location  a  linear  least-squares  fit 
must  be  performed  to  obtain  the  corresponding  piecewise  linear  smooth  and  its  associated  ASR. 
Thus  approximatlev  .V/.V/  linear  least-squares  fits  must  be  computed  to  place  each  knot.  If  this 
were  implemented  in  a  straightforward  manner  it  would  give  rise  to  prohibitive  computation  in  all 
but  the  richest  computing  environments.  Enormous  computational  gains  can  be  realized,  however, 
by  examining  the  set  of  eligible  knot  locations  in  a  special  order  that  permits  the  use  of  rapid 
updating  formulae  associated  with  the  basis  (8).  This  strategy  involves  visiting  the  potential  knot 
positions  in  descending  abscissa  value  and  taking  advantage  of  the  fact  that  (for  t'  >  t") 

f  0  i  <  t" 

(z  -  t")+  -  (x  -  r'l-  =  {  I  -  t"  (12) 

l  t'  -  t"  x  >  t' 

The  linear  least-squares  fit  for  the  kth  knot  (located  at  t k  =  t")  can  be  accomplished  by  solving 
the  normal  equations 

Ba  —  c  ( 13) 
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where  B  is  the  k  x  k  covariance  matrix  of  the  k  basis  functions  (3), 

.v 

Bn  -  53  *<(  *•)(*>(*»)  ~*>i’  (14a) 

and  c  is  the  k- dimensional  covariance  vector  of  the  response  with  each  basis  function 

.v 

ci  =  53(yi  ~  y)bj(x>)-  (146> 

i=l 

Here  bj  and  y  represent  the  averages  of  the  corresponding  quantities.  The  solution  vector  a  = 
(oi,  •  •  • ,  a/c)  represents  the  coefficients  corresponding  to  the  optimizing  piecewise  linear  fit  (7)  given 
the  knot  locations  tj,  •  •  -,tk.  The  AS  R  of  the  fit  is  then  given  by 

? 

ASR  =  Var [Y)  -  £  ajCj/\.  ( 14c) 

Using  (13),  (14)  as  prescriptions  for  computing  the  corresponding  quantities  at  each  potential 
knot  location  leads  to  the  prohibitive  computation  mentioned  above.  The  first  thing  to  notice  in 
attempting  to  save  computation  is  that  only  ck  and  5jt,  1  <  j  <  k  need  to  be  recomputed  since 
only  the  fcth  knot  location  is  changing.  (This  reduces  the  computation  by  a  factor  of  k.)  The  next 
thing  to  note  is  that  if  these  quantities  have  already  been  computed  for  a  knot  located  at  tk  =  t' 
then,  from  (12),  a  simple  series  of  updates  gives  them  for  a  knot  located  at  tk  =  t"  It"  <  t1).  Let 


so  =  53  (j/'  ~  5)- 

r,>f 


Then 


=  53  (bj(xi) -bj).  l<j<k-l. 
x,>c 

u  —  53  1,  and  v  =  53  x>- 

r,>t‘  r,>C 

C*(<")  =  Cfc(i')  *  (t' -  f")s0  +  53  (x,  -  t")(y,  -  y). 

!"<x,  <!' 

B,k(t")  =  Bjk(t')  +  it'  -  t")s,  *  53  it,  -  t")(bj(x,)  -  b.).  1  <  j  <  k  -  1 

t"  <r,  <t' 


Bkk{t")  =  Bkk{t')-(t‘'- -t'n)u  +  2U' 55  u;  -  t">3 

<r,  < 

gives  the  quantities  that  enter  into  the  normal  equations  (13)  ior  tk  =  t" .  given  their  values  at 
tk  =  t' .  All  values  are  initialized  to  zero  (i.e.  ct(x.v)  =  B,kii\\  =  0.  1  <  j  <  k). 
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These  updating  formulae  provide  the  ingredients  for  the  normal  equations  ( 13)  at  all  potential 
knot  locations  with  total  computation  of  order  kN .  It  remains  to  solve  the  normal  equations  at 
the  (approximately  S/M)  eligible  locations  for  knot  placement.  This  can  be  done  most  rapidly  by 
using  the  Cholesky  decomposition  of  B  followed  by  back-substitution  fsee  Golub  and  Van  Loan. 
1983).  Since  only  the  last  row  and  column  of  B  are  changing,  its  Cholesky  decomposition  can  be 
updated  with  k1  computations  (Golub  and  Van  Loan,  1983).  The  back  substitution  can  also  be 
performed  in  k 2  computation.  Therefore  the  dominant  part  of  the  computation  for  optimizing  the 
ASR  with  respect  to  the  position  of  the  fcth  knot  is  of  order  fc2,V/.Vf.  The  computation  associated 
with  a  single  linear  least-squares  fit  is  of  order  k2.V.  Therefore,  the  updating  strategy  permits  the 
implicit  evaluation  of  .V/M  linear  least  squares  fits  with  less  computation  than  a  single  such  fit. 
The  entire  procedure  for  placing  ail  Kmxx  knots  in  the  forward  stepwise  procedure  requires  roughly 
the  same  computation  as  Kmxx/3  linear  least  square^  fits  with  Kmxx  variables. 

The  computational  strategy  outlined  above  emphasizes  speed  over  numerical  stability.  First  of 
all.  the  one  sided  basis  (8)  is  known  to  have  poor  numerical  properties  compared  to  other  possible 
representations  of  piecewise  linear  functions  (de  Boor,  1978).  Their  advantage  lies  in  the  fact 
that  each  basis  function.is  characterized  by  a  single  knot.  This  leads  to  the  simple  and  rapidly 
computable  updating  formulae  derived  above.  A  second  compromise  is  the  choice  of  the  normal 
equations  with  the  Cholesky  decomposition  of  the  basis  covariance  matrix  to  perform  each  linear 
least-squares  fit.  It  is  well  known  that  using  the  QR  decomposition  of  the  basis  ‘‘data"  matrix  would 
provide  superior  numerical  properites  (see  Golub  and  Van  Loan.  1983).  Unfortunately,  updating 
the  QR  decomposition  requires  computation  proportioned  to  k.V  (compared  to  k2  for  the  Cholesky 
strategy)  which  would  cause  the  total  computation  to  be  proportional  to  .V2. 

Potential  numerical  difficulties  associated  with  this  particular  strategy  are  mitigated  by  two 
factors.  First  the  minimal  span  requirement  (11)  limits  somewhat  the  correlation  between  basis 
functions  (8)  associated  with  adjacent  knots.  Second,  for  sample  sizes  that  are  not  extremely  large, 
the  number  of  knots  is  generally  quite  small,  keeping  the  size  of  the  associated  least-squares  problem 
small.  Numerical  problems  tend  only  to  arise  when  this  strategy  is  applied  to  very  large  problems 
(typically  .V  >  .500)  for  which  the  resulting  solution  is  a  very  complex  curve  requiring  a  great  many 
knots.  For  these  cases  numerical  stability  can  be  achieved  by  slightly  deoptimizing  the  least-squares 
fit  (13)  at  each  potential  location  for  the  fcth  knot.  The  basis  coefficients  a  =  iui.-  -.at)  of  the 
piecewise  linear  fit  are  taken  to  be  the  solution  to 


i  B  4-  f.Da  —  c. 
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with  I  being  the  i  x  k  identity  matrix,  ana  the  value  of  €  chosen  to  be  just  iarge  enougn  tc 
maintain  numerical  stability.  Although  these  coefficient  values  can  be  somewhat  different  from 
those  produced  by  (13)  in  highly  collinear  settings,  they  produced  nearly  identical  curve  estimates 
(7).  The  criterion  used  to  select  the  best  knot  location  is  still  the  AS R.  Typically,  taking 

€  =  10-5  trace  B/k 

maintains  stable  computation  while  having  very  little  effect  on  the  resulting  curve  estimate. 

2.4  Model  Selection 

In  order  to  implement  the  forwards/backwards  stepwise  knot  placement  strategy  described  in 
Section  2.1  it  is  necessary  to  have  an  estimate  of  the  future  prediction  error.  For  procedures  that 
are  linear  in  the  responses  (4)  a  variety  of  estimators  (mode)  selection  criteria)  have  been  proposed 
(Akaike,  1970,  Mallows,  1973,  Craven  and  Wahba,  1979,  Shibata.  1980,  Breiman  and  Freedman, 
1983).  For  a  given  knot  placement  (fixed  set  of  regression  variables)  our  method  is  linear  in  the 
responses.  However,  we  use  the  response  values  to  determine  where  to  place  the  knots.  As  a 
result  our  curve  estimator  is  not  linear  in  the  responses  (H(z,Xi)  depends  upon  y,  •  •  j/„).  There  is 
increased  variance  in  the  curve  estimates  corresponding  to  the  variability  associated  with  the  knot 
placement  that  is  not  incorporated  into  the  above  criteria.  For  nonlinear  procedures,  techniques 
based  on  sample  reuse  (Cross-validation.  Stone,  1974,  and  Bootstrap,  Efron.  1983)  are  appropriate. 
These  require  considerable  computation,  however,  and  a  common  practice  is  simply  to  ignore  the 
increased  variability  associated  with  model  selection.  If  the  number  of  selected  variables  is  not  very 
much  smaller  than  the  size  of  the  initial  set.  the  increased  variance  is  not  large,  and  such  a  strategy 
may  be  effective.  In  our  situation,  however,  this  is  not  the  case.  We  intend  to  select  a  few  knots 
usually  from  a  very  large  number  of  potential  locations. 

The  basis  for  our  model  selection  strategy  lies  in  the  work  of  Hinklev  (1969,  1970)  and  Feder 
(1975).  They  consider  the  problem  of  testing  the  hypothesis  that  a  two-segment  piecewise  linear 
regression  function  in  fact  consists  of  only  a  single  segment,  in  the  presence  of  normal  homoscedastic 
errors.  Specifically,  it  is  assumed  that 

V ,  —  a  4-  6.X  i  4-  c(  A',  —  t  +  £,  (  15 ) 

with  s,  ~  .V(O.ct-),  and  one  wishes  to  test  the  hypothesis  that  c  =  0.  If  the  knot  location  t  is 
specified  in  advance  then  (under  the  null  hypothesis  Ha  :  c  =  0)  the  difference  between  the  (scaled) 
residual  Sums  of  squares  from  the  respective  two  and  three  parameter  least-squares  fits  follows  a 


chi-squared  distribution  on  one-degree-of-freedom,  yj.  That  is.  the  additional  parameter,  c.  uses 
one  additional  degree- of- freedom. 

When  one  adjusts  the  knot  location  f,  as  well  as  the  coefficient  e.  then  this  is  no  longer  the 
case.  Furthermore,  under  the  condition  c  =  0  the  parameter  t  is  not  identifiable,  and  so  we 
cannot  use  the  usual  asymptotic  theory  and  just  add  a  degree-of-freedom  for  the  additional  fitted 
parameter  t.  Feder  (1975)  shows  that  (under  H0  :  c  =  0)  the  difference  between  the  residuad  sum- 
of-squares  from  the  respective  two  and  four  parameter  fits  asymptotically  follows  the  distribution  of 
the  maximum  of  a  large  number  of  correlated  and  yj  random  variables.  Furthermore,  the  precise 
correlational  structure  (and  thus  the  distribution)  depends  on  the  spacings  of  the  observations.  Such 
a  distribution  will  give  rise  to  considerably  larger  test  statistic  values  than  y2  and  generally  larger 
values  than  even  y2.  That  is,  the  additional  parameter  t  uses  more  than  one  additional  degree- 
of-freedom.  Hinkley  (1969,  1970)  reports  strong  empirical  evidence  that  the  distribution  closely 
follows  a  chi-squared  on  three  degreeof- freedom.  Thus,  fitting  both  the  additional  coefficient,  c. 
and  the  corresponding  knot  location,  t.  uses  about  three  additional  degrees-of- freedom. 

A  similar  effect  was  reported  by  Hastie  and  Tibshirani  (1985)  in  the  context  of  projection 
pursuit  regression  (Friedman  and  Stuetzle.  1981).  Here  the  model  is 

p 

y >  =  +  s<- 

j=\ 

with  e  ~  ,V(0,<r2),  and  g  is  a  smooth  function  whose  argument  is  a  linear  combination  of  the  p 
predictor  variables.  The  objective  is  to  minimize  the  residual  sum  of  squares  jointly  with  respect 
to  the  parameters  defining  both  the  function  and  the  linear  combination  in  its  argument.  The 
null  hypothesis  ffo  is  that  g  is  a  constant  function.  Hastie  and  Tibshirani  (1985)  performed  a 
simulation  experiment  to  obtain  the  distribution  of  the  scaled  difference  of  the  residual  sum  of 
squares  as  a  function  of  the  number  of  parameters  associated  with  the  function  g.  for  p  =  5  and 
.V  =  360.  They  found  that  the  expected  value  of  this  distribution  was  always  greater  than  the  sum 
of  the  number  of  parameters  associated  with  both  the  curve  and  the  linear  combination  (except 
for  the  degenerate  case  -  g  linear).  This  effect  became  more  pronounced  as  more  parameters  were 
associated  with  g.  These  results,  together  with  those  of  Hinkley  (1969.  1970)  and  Feder  (1975). 
indicate  that  the  number  of  degrees-of- freedom  associated  with  nonlinear  least-squares  regression 
can  be  considerably  more  than  the  number  of  parameters  involved  in  the  fit. 

Our  knot  placement  strategy  does  not  perform  an  unrestricted  minimization,  but  rather  min¬ 
imizes  the  ASR  over  a  restricted  set  of  eligible  knot  locations.  In  the  absence  of  a  large  number  of 
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ties,  however,  the  solution  value  for  the  ASR  is  not  likely  to  be  a  great  deal  different.  Thus,  follow¬ 
ing  Hinkley  (1969,  1970)  and  associating  a  loss  of  three  degrees-of- freedom  for  each  knot  adaptively 
placed  (with  our  strategy)  seems  reasonable,  if  a  bit  conservative.  We  therefore  use 

d(K)  =  3I{  +  1,  (16) 

in  conjunction  with  the  generalized  cross-validation  estimate  of  future  prediction  error  (6),  as  a 
model  selection  criterion  to  be  minimized. 

2.5  Piecewise  cubic  fitting 

Continuous  piecewise  linear  curves  provide  maximum  flexibility  for  a  given  (small)  number  of 
knots.  They  also  have  the  advantage  of  ready  interpretation:  linear  relationship  within  subintervals 
of  the  range  of  X.  Their  principal  disadvantage  is  the  discontinuity  of  the  first  derivative  (infinite 
second  derivative)  at  each  knot  location.  This  causes  the  curve  to  be  cosmetically  unappealing  to 
some. 

Also,  if  the  true  underlying  function  /'( i)  (5)  does  not  have  a  locally  high  second  derivative 
close  to  a  knot  location,  then  a  piecewise  linear  approximation  will  exhibit  a  small  increased  error 
in  the  neighborhood  near  that  knot.  (This  is  in  contrast  to  the  corresponding  first,  and  especially, 
the  second  derivative  estimates  which  contain  much  larger  errors.)  If  the  second  derivative  of  /’(z) 
is  everywhere  slowly  varying  then  (slightly)  more  accurate  curve  estimates  can  be  obtained  by 
restricting  the  variation  of  the  second  derivative.  This  is  at  the  expense  of  reduced  flexibility  to  fit 
curves  that  do  have  locally  rapidly  varying  second  derivatives. 

The  same  considerations  (see  Section  2.0)  that  led  to  the  desirability  of  piecewise  linear  ap¬ 
proximations  guide  our  approach  to  piecewise  cubic  fitting.  We  seek  a  curve  estimate  whose  func¬ 
tion  and  first  derivative  values  are  everywhere  continuous.  Under  that  constraint  we  would  like  an 
estimate  that  closely  resembles  the  corresponding  piecewise  linear  fit.  In  particular,  we  do  not  wish 
to  require,  in  addition,  everywhere  continuous  second  derivatives. 

A  simple  modification  of  our  basis  functions  (3)  (used  for  piecewise  linear  fitting)  leads  to  an 

appropriate  basis  for  the  corresponding  piecewise  cubic  approximation: 

( 0  t  x<t*_ 

Bk(i)  =  <  <?*U  -  t*-)2  +  rk(x  -  tk-)3  tk_<x<ti'+  (17) 

V  X  -  tk  tk  +  <  X 

with  <  tk  <  tk+- 

Setting  the  coefficients  qk  and  rk  to 

7*r  =  (2t*+  4-  tk-  —  3t* )/( tin.  —  )' 

~  (2tk  -  ti+  -  I  IS) 
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causes  Bk( x)  (17)  to  be  everywhere  continuous  and  have  continuous  first  derivatives.  Outside  the 
interval  t*_  <  x  <  tk+,  Bk(x)  is  identical  to  the  corresponding  piecewise  linear  basis  function 
6k(x)  (8)  with  a  knot  at  tk.  Inside  the  interval  Bk( x)  is  a  cubic  function  whose  average  first 
and  second  derivatives  (over  the  interval)  match  those  for  the  corresponding  bk(x).  The  second 
derivatives  of  of  Bk(x)  exhibit  discontinuities  at  tk+  and  t*_.  Far  from  the  central  knot  location 
ffc,  B k(x)  has  the  same  properties  as  bk( x),  so  that  both  bases  will  have  similar  characteristic  spans 
(see  Section  2.0).  Close  to  the  central  knot  (inside  [t*_,t*+j)  Bk{x )  ::  an  approximation  to  6*(x) 
with  continuous  first  derivative. 

Knot  placement  based  on  piecewise  linear  fitting  (Sections  2.1  -  2.4)  is  used  to  select  knot 
locations  for  piecewise  cubic  fits.  The  resulting  knot  locations  t\  ■■■tK  are  used  as  tne  central 
knots  for  the  cubic  basis  B\(x)  ■  ■  ■  Bfc(x)  (17).  The  side  knots  {t*_.  tk+j,  1  <  k  <  K,  are  placed  at 
the  midpoints  between  the  central  knots.  Let  f( j)  •  •  •  !(/<■)  be  the  central  knots  in  ascending  abscissa 
value.  Then 


(<k)-  =  (l(jfc)  +  t(k-l)  )/2 

*(*)+  =  (£(fc)  +  <(*+l))/2  (19) 

for  2  <  k  <  K  -  1.  The  extreme  knot  locations.  tl  +  and  («■_  are  defined  as  in  ( 19).  The  outer  side 
knots  are  defined  by 


£(D-  =  (£d)  +  *(i))/2 

hx)+  -  {t(K)  +  Jt(.v) )/2  (20) 

where  X(d  and  X(,vj  are  respectively  the  lowest  and  highest  sample  abscissa  values.  If  the  knot 
placement  procedure  happens  to  put  a  knot  at  x(i)  (pure  linear  term  in  the  model)  then  the 
corresponding  basis  function  is  taken  to  be  B(d(x)  =  x  -  x(1). 

The  piecewise  cubic  curve  estimate 

K 

fc(x)  =  a0  +  ^  akBk(x)  (21) 

*=i 

is  obtained  by  minimizing  the  ASR  with  respect  to  the  coefficients  ao-aa*.  In  the  interior. 
£(D-  <  x  <  <(/<)+,  it  is  piecewise  cubic  with  second  derivative  discontinuities  at  the  midpoints 
between  the  central  knots  ttkl+  =  1  <  k  <  I\  —  1.  In  the  outer  regions,  x  <  t(ll_ 

or  x  >  the  curve  estimate  is  taken  to  be  linear.  This  helps  to  control  the  high  variance 

associated  with  the  extremes  of  the  interval  ("end  effects"). 
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Although  the  piecewise  cubic  fit  seldom  provides  a  dramatic  improvement,  it  requires  very  little 
computation  (one  additional  linear  least  squares  fit)  beyond  that  required  for  the  (piecewise  linear) 
knot  placement.  One  cam  compare  the  GCV  (6)  (16)  (equivalently,  the  ASR)  for  the  piecewise 
linear  and  cubic  estimates,  choosing  the  one  that  is  best.  If  a  strong  prejudice  exists  for  continuous 
first  derivatives,  then  one  might  prefer  the  cubic  estimate  even  if  it  provides  a  slightly  poorer  fit  to 
the  data. 

3.0  Additive  modeling 

The  simplest  extension  of  smoothing  to  the  case  of  multiple  predictor  variables,  .Yi  •  •  •  Xp,  is  the 
additive  model  (2).  Flexible  additive  regression  has  been  the  focus  of  considerable  recent  interest. 
It  is  a  special  case  of  the  projection  pursuit  regression  mode!  (“projection  selection”,  Friedman  and 
Stuetzle,  1981).  It  also  represents. special  cases  of  the  ACE  (Breiman  and  Friedman,  1985)  and 
generalized  additive  models  (Hastie  and  Tibshirani,  1984,  1986).  Stone  and  Koo  (1985)  suggest 
additive  modeling  based  on  a  central  cubic  spline  approximation,  with  linear  approximation  past 
the  extremes,  and  nonadaptive  knot  placement. 

The  smoothing  procedure  described  in  the  previous  section  has  a  natural  extension  to  multiple 
predictor  variables.  The  piecewise  linear  basis  functions  analogous  to  (8)  become 

bk(x)  =  (xl(k)-tk)+  (22) 

where  k ,  l  <  k  <  K,  labels  the  knots  and  j(k).  1  <  j(k)  <  p.  labels  a  predictor  variable 

corresponding  to  each  knot.  Each  knot  location  tk  is  associated  with  a  particular  predictor  variable. 
j(k),  and  all  of  the  predictor  variables  provide  eligible  locations  for  knot  placement.  Additive 
modeling  in  this  context  can  simply  be  regarded  as  a  (univariate)  smoothing  problem  with  a 
larger  number  (pN  versus  .V)  of  ordinate  abscissa  pairs.  The  forward/backward  knot  placement 
strategy,  minimum  span  (with  pN  replacing  .V),  and  model  selection  criteria  directly  apply,  as 
do  the  updating  formulae  derived  in  Section  2.3  (reinitialized  to  zero  for  each  new  variable).  The 
resulting  piecewise  linear  model 

K 

f{x)  =  a0  t  y,  aid  -  gfc)'*'  (23  I 

fc=\ 

can  be  cast  into  the  form  given  by  (2)  with 

/,U,)=  ak{x,-tk)*,  l  <  i  <  p- 
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Note  that  the  means  of  the  individual  (predictor)  variable  functions  (24)  can  be  considered  arbitrary 
for  purposes  of  interpretation. 

The  corresponding  piecewise  cubic  basis  (17)  is  constructed  in  a  manner  analogous  to  that 
for  the  smoothing  problem  (p  =  1).  The  only  difference  is  that  the  side  knots  f( *>-*-  (19)  are 
positioned  at  the  midpoints  between  the  central  knots  (t*)  defined  on  the  same  variable.  The  end 
knots  (20)  are  positioned  using  the  corresponding  endpoints  on  the  same  variable.  The  resulting 
basis  functions  5fc(ij(fc))  define  individual  variable  functions  analogously  to  (24) 

f,{x,)  =  ^2  akBk(x ,),  1  <  i  <  p,  (25) 

again  with  arbitrary  means. 

Although  exceedingly  simple,  this  method  of  additive  modeling  has  some  powerful  character¬ 
istics.  The  knot  placement  strategy  considers  each  potential  knot  location  in  conjunction  with  all 
existing  knots  on  all  the  predictor  variables  -  not  just  those  defined  on  the  same  variable  -  when 
deciding  whether  to  add  (or  delete)  a  particular  knot.  At  each  point  the  forward  stepwise  strategy 
decides  (in  a  natural  way)  whether  to  increase  the  flexibility  of  an  already  existing  variable  curve 
(24)  (25)  or  whether  to  add  another  variable,  either  linearly  or  nonlinearlv.  Variable  subset  selec¬ 
tion  thereby  occurs  as  a  natural  byproduct  of  this  approach.  Note  that  the  smallest  abscissa  value 
on  each  predictor  variable  is  always  made  eligible  for  knot  placement  (irrespectiveof  the  minimum 
span  value  -  Section  2.2)  so  that  any  predictor  variable  can  potentially  .enter  in  a  purely  linear  way. 

The  additive  modeling  strategy  outlined  above  places  no  special  emphasis  on  linearity.  A 
purely  linear  relationship  in  any  variable  is  represented  by  one  of  the  eligible  knot  locations  (the 
first)  on  that  variable.  One  can  (if  desired)  place  such  special  emphasis  by  requiring  that  the  first 
knot  entered  for  each  variable  be  at  its  smallest  value.  The  price  paid  for  this  is  increased  variance 
in  estimating  some  monotone  relationships  and  dramatically  increased  bias  against  non-monotone 
relationships. 

Our  strategy  does,  however,  place  some  special  emphasis  on  monotonicity.  Monotone  trends 
will  enter  before  somewhat  stronger  highly  nonmonotone  relationships.  Also,  there  is  a  slight 
preference  for  certain  types  of  monotone  trends,  namely  those  that  start  with  a  small  slope.  These 
can  be  approximated  with  a  single  knot  as  can  a  purely  linear  trend. 

This  method  of  additive  modeling  is  invariant  to  the  locations  and  individual  spreads  of  the 
variables.  Translating  or  rescaling  each  of  the  variables  by  a  (different)  constant  factor  will,  in 
principle,  not  affect  the  solution.  If.  however,  the  predictor  variables  have  very  large  absolute  loca¬ 
tions  (compared  to  their  scales)  and/or  wildly  different  scales,  there  can  be  undesirable  numerical 
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consequences  associated  with  the  updating  and  least-squares  fitting.  In  such  cases  (as  with  ordi¬ 
nary  linear  least-squares  regression)  it  is  wise  to  center  and/or  rescale  the  predictor  variables  to 
remove  the  large  locations  and/or  wild  scale  differences  before  applying  the  modeling  procedure. 
The  resulting  solution  is  easily  transformed  back  to  the  original  variable  locations  and  scales. 

4.0  Confidence  intervals 

When  attempting  to  interpret  the  individual  predictor  variable  curve  estimates,  it  is  important 
to  have  a  notion  of  how  far  the  estimate  is  likely  to  deviate  from  the  true  underlying  (population) 
conditional  expectation.  This  can  be  quantified  by  the  expected  squared  error 

£(/*(x ;)  -  /,(i,)l2  =  (/'(*<)  -  £/,(x,))2  +  Var/.Ui).  (26) 

Here  /“(*<)  is  the  true  population  curve  and  /,(x,)  is  the  estimate  from  the  sample.  The 
expected  values  in  (26)  are  over  repeated  samples  of  size  N  drawn  from  the  population  distribution. 
For  linear  (nonadaptable)  procedures  (knots  fixed  in  advance)  and  homoscedastic  errors  (1),  one 
can  estimate  the  variance  term  in  (26)  through  standard  formulae  for  the  covariances  of  the  at 
appearing  in  (24)  and  (25)  and  an  estimate  ofthe  true  underlying  error  variance.  <r2.  With  adaptable 
procedures  such  as  ours  this  can  be  highly  overoptimistic  because  it  does  not  account  for  the 
variability  associated  with  the  knot  placement. 

One  way  to  mitigate  this  effect  is  fn  inflate  a~  to  account  for  the  additional  degrees-of-freedom 
used  by  the  adaptive  knot  placement  (total  of  three  for  each  knot).  Even  this,  however,  does 
not  give  completely  satisfactory  results.  For  example,  the  (constant)  predictor  variable  curves 
associated  with  no  knots  would  be  calculated  to  have  zero  variance.  This  is  clearly  not  the  case.  In 
addition,  there  is  seldom  reason  to  expect  homoscedasticity.  Even  if  one  could  accurately  estimate 
the  variance  it  is,  in  any  case,  only  one  part  of  the  expected-square-error.  There  is  still  the  unknown 
and  potentially  large  bias-squared  term  in  (26). 

Bootstrapping  (see  Efron  and  Tibshirani.  1986)  provides  a  means  of  estimating  the  variance  of 
the  curve  estimates  (assuming  only  independence)  and  can  give  some  indication  of  the  bias  as  well. 
This  is,  of  course,  at  the  expense  of  additional  computing.  However,  the  additive  modeling  proce¬ 
dure  described  here  is  generally  fast  enough  (see  Section  2.3)  to  permit  substantial  bootstrapping, 
and  honest  uncertainty  estimates  are  usually  worth  it. 

The  basic  idea  underlying  the  bootstrap  is  to  substitute  the  sample  for  the  population  and 
study  the  behavior  of  estimates  under  repeated  samples  of  size  .V  drawn  from  it.  In  particular,  we 
can  estimate  the  expected  squared  error  (26)  by 

£[/,'(*.)  -  /.(Xijj2  =  £0l7.(x,)  -  /,l8'(x,  j].;  f  27) 


19 


I* 


Here  Eg  is  the  expected  value  over  repeated  “bootstrap”  samples  of  size  .V  drawn  (with  replace¬ 
ment)  from  the  data,  and  fjBl  is  the  (ith)  curve  estimate  for  the  bootstrap  samples.  In  fact,  one 
can  approximate  the  distribution  of  /,"(zi)  —  f,(x<)  by  that  of  /,•(*<)  —  ffB\x,). 

Our  goal  is  to  take  maximal  advantage  of  the  flexibility  of  the  bootstrap  to  estimate  asymmetric 
intervals  about  the  curve  that  reflect  the  potentially  asymmetric  nature  of  the  distribution  of 
f’(xi)  —  This  can  be  due  to  either  asymmetric  error  distribution  or  biased  curve  estimates 

(or  both).  In  addition,  we  wish  our  interval  estimates  to  reflect  (probable)  heteroscedasticity  of  the 
errors.  To  this  end  we  repeatedly  draw  bootstrap  samples  (of  size  N  with  replacement)  from  the 
data.  For  each  such  sample  we  perform  the  same  modeling  procedure  as  was  applied  to  the  original 
data,  thereby  obtaining  a  set  of  curve  estimates  f]B\: r,),  1  <  i  <  p.  At  each  (original  data)  value. 
x, ,  two  averages  are  computed: 

e2+(x <)  =  -  f!B\x,)l2  (28 a) 

ei(ii)  =  Eg">lfiixi)  ~  f!B)(Xi)]2-  (280) 

The  first  average  (28a)  is  over  those  bootstrap  replications  for  which  f,(xt)  —  f,iB\ x,)  >  0.  and  the 
second  (28b)  is  over  those  for  which  /,(x,)  -  /(B,(Xj)  <  0.  The  individual  averages  so  obtained 

at  each  value  of  Xj,etj.(xj),  are  then  smoothed  against  x,  using  a  simple  (constant  span)  running 

average  smoother.  The  resulting  smoothed  estimates  e^(xt)  are  then  used  to  define  confidence 
intervals  about  the  original  data  estimate  f,(x,): 

=  /.U.)  ±  \]e ±(*t)-  (29) 

In  addition  to  assessing  the  variability  of  the  individual  predictor  variable  curve  estimates 
fi(Xi),  it  is  important  to  obtain  a  realistic  estimate  of  the  future  prediction  error.  FPE.  of  the 
entire  additive  model  (2), 

FP£  =  £[y-f/,M;. 

1=1 

Here  the  expected  value  is  over  the  population  joint  distribution  of  the  response  and  predictor 
variables.  Sample  reuse  techniques  such  as  bootstrapping  (Efron.  1383)  and  cross-validation  (Stone. 
1974)  provide  a  variety  of  such  estimates.  Of  these,  the  so-called  "632-bootstrap”  has  shown 
superior  performance  in  several  simulation  studies  (Efron.  1983.  Gong.  1982.  Crawford.  1936). 
This  estimate  is  a  convex  combination  of  two  different  estimates 

FP£« 32  =  0.6.32 FPE\b  +  0.36SASR.  ,  30  i 
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The  second,  ASR.  is  the  average  squared  residual  corresponding  to  the  original  data  fit.  The  first 
estimate,  FPE\b  ,  is  obtained  from  bootstrap  sampling.  As  a  consequence  of  the  random  nature  of 
selecting  observations  for  the  bootstrap  samples,  a  (different)  subset  of  the  observations  will  fail  to 
be  selected  to  appear  at  all  in  a  particular  bootstrap  sample.  On  average.  0.368  .V  data  observations 
will  not  contribute  in  this  way  to  a  bootstrap  sample.  Each  time  an  observations  does  not  so  appear, 
its  prediction  error  (squared)  is  computed,  based  on  the  model  estimated  from  the  corresponding 
bootstrap  sample  from  which  it  is  absent.  The  quantity  FPE\g  is  the  average  of  these  prediction 
errors  over  all  such  left  out  observations  throughout  the  entire  sequence  of  bootstrap  replications. 

The  bootstrapping  procedure  outlined  above  simulates  situations  where  the  response  and  pre¬ 
dictors  are  both  random  variables  sampled  (independently)  from  some  joint  distribution.  That  is. 
if  another  sample  were  to  be  selected,  different  values  of  the  predictor  variables  as  well  as  the  re¬ 
sponse  would  be  realized.  Therefore,  the  resulting  confidence  interval  and  FPE  estimates  are  not 
conditional  on  the  design  (realized  set  of  predictor  values).  This  is  appropriate  in  most  observa¬ 
tional  settings.  There  are  situations,  however,  where  the  design  is  presumed  to  be  fixed.  That  is. 
every  replication  of  the  experiment  results  in  an  identical  set  of  values  for  the  predictor  variables 
and  only  the  responses  are  random.  Bootstrapping  (as  outlined  above)  will  tend  to  over  estimate 
both  the  confidence  intervals  and  the  FPE  in  fixed  design  situations  (just  as  estimates  conditioned 
on  the  design  underestimate  them  for  observational  settings).  Therefore,  if  the  design  is  fixed  these 
bootstrap  estimates  should  be  regarded  as  conservative. 

5.0  Simulation  studies  and  data  examples 

In  this  section  we  compare  the  technique  outlined  in  the  previous  sections  i  referred  to  for  identi¬ 
fication  as  the  '‘TURBO”  smooth/model)  to  some  other  methods  commonly  used  for  smoothing  and 
additive  modeling  through  a  limited  simulation  study  and  application  to  data.  The  goal  is  to  identify 
those  settings  in  which  this  procedure  can  be  expected  to  provide  good  performance  when  compared 
to  existing  methodology.  For  the  smoothing  problem  (p  =  1)  we  compare  with  smoothing  splines 
(Reinsch,  1967),  a  popular  nonadaptive  local  averaging  method,  and  a  recently  proposed  adaptive 
span  smoother.  '‘SUPER  SMOOTHER".  (Friedman,  1984).  With  smoothing  splines  the  rough¬ 
ness  penalty  was  automatically  chosen  through  generalized  cross-validation  (Craven  and  Wahba. 
1979).  For  additive  modeling  we  make  comparisons  with  the  projection  selection/ ACE  approach 
using  SUPER  SMOOTHER.  In  all  examples,  the  knot  placement  increment  is  given  by  i  14!  with 
a  =  0.05. 
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5.1  Smoothing  pure  noise 

This  is  a  simulation  study  to  compare  how  well  these  three  smoothers  estimate  a  constant 
function  in  the  presence  of  homoscedastic  noise.  That  is.  how  much  structure  do  they  estimate  when 
there  is  no  underlying  structure  in  the  population?  A  set  of  response-predictor  pairs  (z,,  y<),  1  < 

i  <  iV,  were  generated,  with  0  <  i,  <  1  randomly  sampled  from  a  uniform  distribution,  and  the 
y,  drawn  from  a  standard  normal  distribution.  Figures  la,  lb,  and  lc  show  a  scatter  plot  of  one 
such  sample  (N  =  20)  with  the  corresponding  TURBO,  smoothing  spline,  and  SUPER  smooths, 
respectively,  superimposed.  The  TURBO  curve  estimate  is  seen  to  be  a  constant  (no  knots)  equal 
to  the  sample  response  mean.  The  smoothing  spline  and  SUPER  SMOOTHER  estimates  show  a 
gentle  dependence  on  x. 

Since  one  cannot  discern  expected  performance  based  on  one  realization,  we  study  average 
performance  over  100  such  realizations,  for  each  of  ,V  =  20  and  .V  =  40.  The  results  are  shown 
in  Figures  Id  and  le  respectively;  for  the  larger  sample  size  the  errors  are  generally  smaller,  but 
the  qualitative  comparisons  are  the  same.  In  both  cases  the  average  absolute  error  is  plotted  as 
a  function  of  abscissa  value.  (For  the  TURBO  smoother,  the  piecewise  linear  and  cubic  smooths 
give  almost  identical  results.)  The  TURBO  smoother  (solid  line)  is  seen  to  give  uniformly  smaller 
average  error  than  the  other  methods,  though  of  course  this  overall  performance  is  mostly  due 
to  the  relative  amount  of  smoothing  chosen  (automatically)  by  the  method  rather  than  to  the 
choice  of  method  itself.  Perhaps  of  more  interest  is  the  uniformity  of  the  error  across  the  range 
of  observations;  for  this  problem  in  particular.  TURBO  seems  not  to  exhibit  large  error  near 
the  ends  of  the  interval  (“end  effects")  associated  with  the  other  methods.  The  especially  poor 
performance  of  SUPER  SMOOTHER  (dashed  line)  in  very  high  noise  environments  has  been 
noted  before  (Breiman  and  Friedman.  1985).  It  is  also  known,  as  most  easily  seen  by  considering 
the  “equivalent  kernel"  formulation  discussed  by  Silverman  (1984),  that  the  smothing  spline  will 
have  higher  variance  near  the  ends.  Also,  the  smoothing  spline  can  be  affected  by  bias  effects  if  the 
true  underlying  curve  does  not  satisfy  appropriate  boundary  conditions  (see  Rice  and  Rosenblatt. 
1983);  Agarwal  and  Studden  (1980)  showed  that  these  end  bias  effects  are  not  felt  if  one  uses 
piecewise  polynomial  models  with  fixed  knots,  but  since  the  underlying  model  is  constant  in  thi' 
case,  the  bias  effects  are  not  relevant.  It  is  clear  that  further  theoretical  work  will  be  required  to 
understand  TURBO’s  apparent  improvement  in  boundary  behavior  over  other  methods. 


5.2  Smoothing  a  monotonic  function 

Our  next  example  increases  the  complexity  of  the  problem  slightly.  Here  .V  =  25  response- 
predictor  pairs  lx,,  j/i)  were  generated  according  to  the  prescription 

>ji  =  exp(6x,)  +-  r,  ‘  131) 

with  the  x,  randomly  drawn  from  a  uniform  distribution  in  the  interval  '0.  1]  and  the  r,  are  drawn 
from  a  ( heteroscedastic)  normal  distribution 

r,  ~  .V(0.  [100(1  -  x)]2).  i  32! 

In  this  example  the  curvature  of  the  true  underlying  conditional  expectation  is  increasing  with 
abscissa  value  and  the  noise  is  heteroscedastic  with  standard  deviation  decreasing  with  abscissa 
value. 

Figure  2a  shows  a  scatter  plot  of  such  a  sample  superimposed  with  both  the  piecewise  linear 
and  piecewise  cubic  TURBO  smooths  and  the  true  underlying  conditional  expectation.  exp(6xi. 
Figure  2b  and  2c  show  the  corresponding  smoothing  spline  and  SUPER  smooths.  In  this  case, 
the  piecewise  cubic  TURBO  estimate  givei  a  slightly  better  fit  than  the  piecewise  linear  to  the 
sample  las  weil  as  the  true  underlying  curve).  The  smoothing  spline  estimate  exhibits  considerable 
variability  in  the  high  noise  region  and  the  SUPER  SMOOTHER  somewhat  less. 

In  order  to  study  expected  performance.  iOO  replications  1  25  observations  each!  were  generated 
according  to  <31).  (  32).  and  fit  with  the  three  smoothing  methods:  piecewise  cubic  TURBO  model, 
smoothing  splines,  and  SUPER  SMOOTHER.  Figure  2d  plots  their  average  absolute  error.  f<x  i  — 
expi  fix  >! .  as  a  funcrion  of  abscissa  value,  x.  In  the  high  noise  region  x  <  0.2  both  the  smoothing 
spline  i  dotted  line  :  and  SUPER  SMOOTHER  'dashed  line)  exhibit  large  error  associated  with 
the  high  variance  of  their  estimates.  In  'he  .ntermpdiate  region  0.2  <  x  <  0.3  both  'he  TURBO 
>soiid  line  i  and  Sh  PER  smoothers  have  comparable  performance.  In  the  low  noise  high  curvature 
extreme,  x  >  0.0,  ail  three  methods  produce  considerable  incensed  error  bias'  with  the  SUPER 
SMOOTHER  degrading  'he  least.  Over  most  of 'he  region  the  nonadaprable  ■  smoothing  spline 
method  gives  relatively  poor  performance.  This  might  be  .>\-pe<—ed  sin-e  ’noth  the  curvature  and 
noise  .ovei  are  varying,  thereby  .-.tuning  a  -i neb1  -man  value  to  :,e  ie^s  appropriate. 

■5.3  A  difficult  smoothing  problem 

Our  hn.n  smoothing  example  interned  -o.-ir,  hate  m..-or-  .mriac*  data  in  Silverman 


distribution  in  the  interval  [—0.2, 1.0]  and  the  y,-  given  by 

j  £i  <  0 

y'  \  sin[2x(l  -  x,)2]  +  s,  U  <  x,  <  1 
with  the  £ i  randomly  generated  from 

£,  iV[0,  max2(0.05.  x,)]. 

The  second  derivative  of  the  underlying  conditional  expectation  changes  sign  four  times  and  is 
infinite  at  x  =  0.  The  standard  deviation  of  the  additive  noise  is  small  and  constant  for  X  <  0.05, 
and  then  increases  linearly  with  x.  Figure  3a  shows  a  scatter  plot  of  such  a  sample.  Figure 
3b  superimposes  the  piecewise  linear  and  cubic  TURBO  smooths  along  with  the  true  underlying 
conditional  expectation.  Figures  3c  and  3d  show  respectively  the- corresponding  smoothing  spline 
and  SUPER  SMOOTHER  smooths.  All  but  the  piecewise  linear  estimate  have  a  downward  bias 
at  the  derivative  discontinuity.  Both  TURBO  smooths  have  a  downward  bias  at  the  minimum, 
whereas  the  smoothing  spline  and  SUPER  smooths  have  an  upward  bias.  The  smoothing  spline 
estimate  exhibits  considerably  more  variation  in  the  higher  noise  regions.  The  piecewise  cubic 
TURBO  smooth  again  gives  a  slightly  better  fit  to  the  data  than  does  the  piecewise  linear. 

As  in  the  previous  examples,  we  compare  expected  performance  of  the  three  methods  over  100 
replications  of  50  observations  each.  Figure  3e  shows  the  average  absolute  error  (from  the  true 
underlying  conditional  expectation)  for  the  piecewise  cubic  TURBO  smooths,  smoothing  splines, 
and  SUPER  SMOOTHER.  In  the  higher  noise  regions  (A"  >  0.25 \  the  TURBO  and  SUPER 
smoothers  are  seen  to  have  comparable  error,  but  in  the  lower  noise  high  curvature  region  f'x  <  0.25) 
the  SUPER  SMOOTHER  exhibits  about  20%  higher  accuracy.  It  has  considerably  less  bias  at 
the  derivative  discontinuity  and  the  minimum  points.  Smoothing  splines  exhibit  relatively  poorer 
performance  over  almost  the  entire  interval.  Again,  this  might  have  been  expected  since  this  is 
a  highly  heteroscedastic  situation  with  varying  curvature.  Nnnadaptabie  smoothers  must  choose 
a  compromise  smoothing  parameter  for  the  entire  region,  whereas  the  adaptable  procedures  can 
adjust  the  span  to  try  to  account  for  such  effects. 

5.4  Additive  modeling  with  pure  noise. 

Since  it.  is  as  important  for  a  method  to  not  find  predictive  structure  when  it  is  absent,  as  it  is 
to  find  it  when  present,  we  first  study  the  performance  ot  our  additive  modeling  procedure  when 
r here  is  no  predictive  relationship  h-  .vveen  the  "esponse  and  predictors.  Two  simulation  -'iperiments 
were  performed.  In  the  first.  100  replications  ot  a  sample  of  size  .V  =  ".0  were  generated.  The 


responses  were  drawn  from  a  standard  norma!  distribution.  There  were  —  10  predictor  variahle« 
each  independently  drawn  from  a  uniform  distribution  in  the  interval  [0. 1].  The  TURBO  modeling 
procedure  was  applied  to  each  of  these  100  replicated  samples.  In  67  replications  no  knots  were 
placed  on  any  of  the  ten  predictors.  The  estimated  response  function  was  taken  as  the  sample 
rt-ponse  mean.  In  24  replications  one  knot  was  placed  and  in  9  cases  two  knots  were  used.  Thus, 
two  thirds  of  the  time  the  TURBO  model  reported  no  predictive  relationship.  In  the  rest  of  the 
cases  it  reported  a  small  one.  Table  1  summarizes  the  distribution  of  both  the  sample  multiple 
correlation  f  R2)  between  the  response  and  the  estimated  model,  and  the  root  mean  squared  distance 
(ESE)1''2  of  the  estimated  model  from  the  truth.  f(x i  ■  •  rio)  =  0. 

For  comparison  we  also  applied  to  these  data  sets  the  projection  selection  procedure  ( Friedman 
and  Stuetzle,  1981),  or  equivalently,  the  ACE  pr  .-edure  with  the  response  transformation  restricted 
to  be  linear  (Breiman  and  Friedman,  1985),  using  the  SUPER  .  IOOTHER  (Friedman.  19841. 
The  corresponding  distribution  of  R2  and  (ESE)1/2  are  also  summarized  in  Table  1.  Ln  contrast 
to  the  TURBO  model,  this  method  is  seen  to  seriously  overfit  the  data  as  reflected  in  the  high 
values  of  both  quantities.  The  propensity  of  ACE  (based  on  the  SUPER  SMOOTHER)  to  overfit 
in  low  signal  to  noise  situations  was  discussed  by  Folkes  and  Kettenring  (1985).  and  Breiman  and 
Friedman  (1985). 

A  second  simulation  experiment  was  performed,  using  the  same  setting  but  increasing  the 
sample  size  of  each  replication  to  .V  =  100.  The  TURBO  model  placed  no  knots  63  times.  The 
frequency'of  one  through  five  knots  were,  respectively  26.  6.  3.  1,  1.  The  corresponding  .distribu¬ 
tions  for  both  methods  are  shown  in  Table  1.  The  increased  sample  size  is  seen  to  improve  the 
performance  of  both  methods  but  the  qualitative  aspects  of  their  comparison  are  the  same  as  with 
the  smaller  (.V  =  50)  sample  size.  The  TURBO  modeling  procedure  is  seen  to  be  fairly  conserva¬ 
tive.  It  should  be  noted  that  the  tendency  of  the  ACE  method  to  drastically  overfit  in  low  signal 
to  noise  small  sample  settings  is  not  a  fundamental  property,  but  is  mainly  a  consequence  of  its 
implementation  using  the  highly  flexible  SUPER  SMOOTHER. 

5.5  A  highly  structured  additive  model 

This  example  is  intended  to  contrast  with  the  previous  one.  As  in  the  previous  example  there 
are  p  =  10  predictor  variables  each  independently  generated  from  a  uniform  distribution  on  ’0.  l\ 
Two  simulation  experiments  of  100  replications  each  were  performed  with  ,\  =  50  and  .V  =  100. 
The  response  variables  were  generated  by 


with  the  £,  independently  drawn  from  a  standard  normal  distribution.  The  function  /*  was  taken 
to  be 


r(xx---x  10) 


O.le4*1  + 


4 

1  +  e-(X5-0.S)/ 0.05 


+  3.Yj  +  2X,  +  ,Y3  . 


In  this  case  the  signal  to  noise  ratio  (standard  deviation  of  /*)  is  2.47.  The  true  underlying  condi¬ 
tional  expectation  is  additive  in  the  ten  predictor  variables.  The  relationship  is  highly  nonlinear  in 
the  first  two,  Unear  with  decreasing  strength  in  the  next  three,  and  constant  (zero)  in  the  last  five. 

Figures  4a  -  4e  show  the  piecewise  linear  and  cubic  curve  estimates  (24),  (25)  for  the  first  five 
variables  in  the  first  replication  of  .V  =  50.  Also,  superimposed  on  the  figures  is  the  true  underlying 
function  for  the  corresponding  variable  (solid  line),  and  with  the  errors  s,  added  to  it  (dots).  As  can 
be  seen  the  TURBO  model  placed  one  knot  on  „Yi,  two  on  .Y; ,  and  one  each  on  variables  .Y3,.Y4, 
and  X$.  No  knots  were  placed  on  the  last  five  predictor  variables.  Both  the  piecewise  linear  and 
cubic  models  fit  the  data  with  R 2  values  of  0.93.  The  root  mean-squared  error  of  the  piecewise 
linear  model  from  the  true  /'(-Yi  •  ■  --Yio)  was  0.45,  whereas  for  the  corresponding  piecewise  cubic 
it  was  0.47. 

More  important  than  performance  on  a  single  sample  is  average  performance  over  100  inde¬ 
pendent  repUcations  of  this  situation.  Table  2  summarizes  the  results  for  piecewise  cubic  fitting. 
The  results  shown  in  Fig.  4  (based  on  the  first  replication  of  the  100)  are  seen  to  be  somewhat 
more  favorable  than  those  on  the  average.  A  second  simulation  experiment  with  100  replications  of 
.V  =  100  observations  each  was  also  performed.  These  results  are  summarized  in  Table  2  as  well. 
The  ACE/SUPER.  SMOOTHER  procedure  was  applied  to  the  same  sets  of  replicated  data  with 
the  results  also  shown  in  Table  2. 

Comparing  the  results,  the  TURBO  modeling  procedure  is  seen  to  exhibit  substantially  better 
performance  in  terms  of  root  mean  squared  error.  The  effect  is.  however,  less  dramatic  than  in 
the  pure  noise  case.  On  average,  ACE/SUPER  SMOOTHER  fits  the  data  sample  3.7  rimes  more 
closely  than  the  TURBO  model  for  .V  —  50.  For  .V  =  100  this  factor  is  1.3.  This  overfitting  results 
in  an  increased  median  modeling  error  of  16%  for  .V  =  50  and  50%  for  .V  =  100.  On  the  other  hand, 
the  TURBO  model  has  a  tendency  ro  be  conservative  and  under  fit  the  data,  producing  estimates 
that  are  sometimes  overly  smooth  (too  few  knots).  This  has  an  interpretational  advantage  and  a 
predictive  advantage  when  the  curvature  variation  of  the  true  underlying  conditional  expectation 
is  reasonably  gentle.  This  example,  however,  simulates  a  situation  in  which  that  variation  is  fairlv 
dramatic  and  the  advantage  of  TURBO  modeling  procedure  lin  terms  of  expected  squared  error  I 
is  thereby  somewhat  reduced. 


5.6  Molecular  quantitative  structure  -  activity  relationship. 

We  illustrate  here  TURBO  modeling  on  a  data  set  from  organic  chemistry  ( Wright  and  Gam- 
bino.  1984).  The  observations  are  36  compounds  that  were  collected  to  examine  the  structure 
activity  relationship  of  6-anilinouracils  as  inhibitors  of  Bacillus  subtilis  DNA  polvmeraze  III.  The 
four  structural  variables  measured  on  each  compound  are  summarized  in  Table  3.  The  response 
variable  is  the  logarithm  of  the  inverse  concentration  of  6-anilinouracil  required  to  achieve  50%  in¬ 
hibition  of  enzyme  activity. 

TURBO  modeling  applied  to  these  data  placed  four  knots:  one  on  the  first  variable,  two  on 
the  second,  and  one  on  the  third.  The  e2  =  1  -  R2  for  the  piecewise  linear  fit  was  0.12.  while  for  the 
piecewise  cubic  it  was  0.11.  The  corresponding  632-bootstrap  estimates  (30)  were  0.23  and  0.22. 
Figures  5a-5d  show  the  piecewise  cubic  curve  estimates  /, ( r,).  i  —  1,4.  along  with  the  bootstrap 
confidence  intervals  (29).  The  data  points  (dots)  on  the  figures  are  the  scaled  residuals  from  the  fit 
added  to  the  curve  at  each  abscissa  value  (component  plus  residual  plot).  The  scale  factor  is  the 
square  root  of  the  ratio  of  the  632  bootstrap  estimate  to  the  resubstitution  e2.  The  curve  estimates 
on  the  first  three  predictors  are  all  seen  to  be  fairly  nonlinear,  especially  the  second  one. 

ACE/super  smoother  was  also  applied  to  these  data.  The  resubstitution  e2  was  0.054  while 
the  632-bootstrap  estimate  was  0.29.  As  in  the  simulated  data  example  (Section  4.51.  ACE/Super 
smoother  is  seen  to  fit  the  data  more  closely  than  the  TURBO  model,  but  the  resulting  overfit 
results  in  inferior  future  prediction  error  in  this  case. 

5.7  Air  pollution  data. 

This  dat.a  set  consists  of  daily  measurements  of  ozone  concentration  and  eight  meteorological 
variables  for  330  days  of  1976  in  the  Los  Angeles  basin.  Table  4  describes  the  variables.  .These 
data  were  introduced  by  Breiman  and  Friedman  (1985)  to  illustrate  the  ACE  procedure.  They 
were  also  analyzed  by  Hastie  and  Tibshirani  (1984)  using  their  Generalized  Additive  modeling 
method  (see  also  Hastie  and  Tibshirani.  1986).  In  contrast  to  previous  examples  this  is  a  large 
(.V=330).  complex,  and  not  very  noisy  data  set.  One  might  therefore  expect  that  the  simple 
TURBO  modeling  procedure  would  be  at  a  disadvantage  when  compared  to  the  more  sophisticated 
approaches  that  have  been  applied  to  these  data. 

Applying  the  TURBO  model  resulted  in  ten  knots  being  placed:  one  each  on  variables  1.  4.  5. 
and  6.  and  two  each  on  variables  3.  8.  and  9.  The  resulting  resubstitution  e2  was  0.20  for  both  the 
piecewise  linear  and  cubic  fits.  The  corresponding  632-bootstrap  estimates  i  20  replications)  were 
0.21  for  both.  The  piecewise  cubic  individual  variable  curve  estimates.  /. i  z,  i.  1  <  i  <  9.  <25)  are 


shown  in  Figs.  6a-6i,  along  with  their  bootstrap  confidence  intervals  (29)  and  (scaled)  residuals. 

Exact  comparison  with  the  ACE  results  in  Breiman  and  Friedman  (1985)  is  not  possible  since 
they  applied  ACE  in  a  mode  that  estimates  an  optimal  (minimum  e2)  response  transformation  as 
well.  The  resulting  response  estimate  was,  however,  not  too  far  from  the  identity  function  so  that  a 
rough  comparison  is  possible.  They  applied  a  variable  based  forward  stepwise  procedure,  selecting 
five  variables.  Their  resubstitution  e 2  for  the  optimal  response  function  was  0.1S.  The  variables 
that  were  selected  and  the  corresponding  curves  are  fairly  consistent  with  (but  not  identical  to)  the 
TURBO  model  results.  Generally,  the  TURBO  curves  are  a  bit  simpler  than  the  corresponding 
ACE/SUPER  smoother  estimates.  Since  bootstrapping  or  cross-validating  the  forward  stepwise 
ACE  procedure  would  be  prohibitively  expensive,  no  estimate  of  (honest)  future  prediction  error 
could  be  given. 

Hastie  and  Tibshirani  (1984)  also  analysed  these  data.  Their  Generalized  Additive  Modeling 
procedure  as  applied  in  this  setting  is  equivalent  to  the  ACE  method  with  the  response  function 
constrained  to  linearity.  Therefore  we  can  make  direct  comparison  with  their  results.  Hastie  and 
Tibshirani  did  not  employ  SUPER  SMOOTHER,  but  rather  a  nonadaptable  local  linear  smoother 
with  constant  span.  With  all  nine  predictors  in  the  regression  function  they  obtained  an  e2  of  0.20. 
With  the  same  subset  of  variables  as  used  by  Breiman  and  Friedman  ( 1985)  the  e:  was  0.22.  Hastie 
and  Tibshirani  (1986)  provide  a  method  of  estimating  the  equivalent  degrees-of- freedom  used  by 
their  fitting  process.  This  estimate  accounts  for  the  flexibility  associated  with  the  resulting  smooths 
but  does  not  account  for  the  (nonlinear)  span  selection  and  variable  subset  selection  process.  They 
report  21.8  degrees-of-freedom  for  their  fit  with  all  variables  and  12.4  for  the  five  variable  subset. 
The  corresponding  degree-of-freedom  count  for  the  TURBO  fit  would  be  11  (constant  term  plus 
coefficients  for  ten  knots). 

6.0  Discussion 

The  examples  of  Section  5  indicate  that  the  smoothing  method  outlined  in  Section  2.  and 
the  corresponding  additive  modeling  procedure  described  in  Section  3.  are  competitive  with  the 
techniques  to  which  they  were  compared.  They  seem  to  have  substantial  advantage  in  situations 
with  low  sample  size  and  high  noise,  where  the  underlying  functions  are  fairly  simple.  In  this  context 
a  simple  function  is  one  that  can  be  reasonably  well  approximated  by  a  piecewise  linear  function 
with  few  (judiciously  placed)  knots.  This  was  the  case  in  the  examples  of  Sections  5.1.  5.2.  5.4.  5.5. 
and  5.6.  Our  procedures  appeared  to  have  similar  performance  to  the  rorresponding  competitors 
in  large  sample  low  noise  situations,  again  with  fairly  simple  underlying  functions  'Section  5.7  . 


The  example  in  Section  5.3  represented  a  moderate  sample  size  situation  with  both  high  and  low 
noise  regions  (strong  heteroscedasticity)  and  a  complex  underlying  function.  In  this  particular  case 
SUPER  SMOOTHER  appeared  to  perform  somewhat  but  not  dramatically  better. 

FORTRAN  programs  implementing  the  procedures  herein  described  are  available  from  the 
authors. 
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Table  1 


Comparison  of  TURBO  and  ACE  additive  modeling  of  pure  noise  (Section  5.4).  The  5.  50, 
and  95  percent  points  are  given  for  the  distribution  of  the  multiple  correlation  R2  (resubstitution), 
and  the  root  expected  squared  error  (ESE)1^2. 

R2  (ESE)1/2 


.05 

.5 

.95 

.05 

.5 

.95 

II 

o 

TURBO 

0.0 

0.0 

0.21 

0.02 

0.15 

0.50 

ACE 

0.T4 

0.91 

0.97 

0.68 

0.85 

1.00 

O 

o 

II 

>, 

TURBO 

0.0 

0.0 

0.12 

O.OOS 

0.12 

0.41 

ACE 

0.49 

o.ro 

0.56 

0.55 

0  69 

0.59 

Table  2 

Comparison  of  TURBO  and  ACE  additive  modeling  in  a  higher  signal  to  noise  situation 
(Section  5.5).  The  5,  50,  and  95  percent  points  are  given  for  the  distribution  of  the  multiple 
correlation  R2  (resubstitution),  and  the  root  expected  squared  error  (ESE)1^. 


R2  i  ESE)1/2 


.05 

.5 

.95 

.05 

.5 

.95 

.V  =  50 

TURBO 

0.79 

0.86 

0.93 

0.34 

0.75 

0.99 

ACE 

0.97 

0.99 

1.0 

0.6.8 

0.87 

1.00 

.V  =  100 

TURBO 

0.84 

0.87 

0.91 

0.31 

0.48 

0.62 

ACE 

0.93 

0.96 

0.99 

0.60 

0.72 

0.85 

:10 


Table  3 


Variables  associated  with  molecular  quantitative  structure- activity  data  example  (Section  5.6). 

.Yj  -  meta  substituent  hydrophobic  constant 
.Yi  -  para  substituent  hydrophobic  constant 
iYj  -  group  size  of  substituent  in  meta  position 
X,  -  group  size  of  substituent  in  para  position 
Y  -  logarithm  of  the  inverse  concentrations  of 
6-anilinouracil  required  to  achieve  50% 
inhibition  of  the  enzyme. 


Table  4 

Variables  associated  with  the  air  pollution  data  example  (Section  5.7). 


.Yi  -  Vandenburg  500  millibar  height 
Xj  -  humidity 

A's  -  inversion  base  temperature 

Yli  -  Sandburg  Air  Force  Base  temperature 

-Y.5  -  inversion  base  height 

A's  -  Daggot  pressure  gradient 

A'r  -  wind  speed 

.Y3  -  visibility 

.Yg  -  day  of  the  year 

Y  ■  Cpland  ozone  concentration 
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A  smoothed  EM  algorithm  for  the  solution  of 
Wicksell’s  corpuscle  problem. 


by 


J.  D.  Wilson 
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respectively  The  improvement  between  figures  4  and  3  is  dramatic  It  is  clear  lhal 

ihe  use  of  ihe  l*.M  algorithm  without  smoothing  not  only  lakes  a  considerable  where  we  take  a—  12.0  and  /l~0l.  As  before,  /„  the  truncated  distribution,  is 

uumber  of  iterations  but  tends  to  a  solution  which  is  loo  “wiggly"  and  erratic  to  obtained  by  setting  t  to  be  0.04 nun  and  R  to  be  0.4mm.  The  Weibull  was  chosen 

be  rcuhsln  figure  I  obtained  after  only  a  fraction  of  the  former  number  of  a*  it  ^  4  distribution  whose  mass  is  concentrated  near  Ihe  point  of  truncation,  the 

iterations  is  much  more  plausible  and  more  accurate  figure  5  could  be  said  to  be  point  where  the  solution  is  mosl  numerically  unstable.  Once  again  10  different 


Wll  Killli  COKTUSCli;  HkOBi  fcM 


WK  KSI'.i  I  S  (  OK1M  I  Mi  t  HUMI  l  M 


£  S  ?  I  . 

*  l  1  |  ' 

£u  o  s 
i  g  2  • 
■=*»£„ 
•—  “*2 

a.fs-s 

aiif 

o  o  a  5  - 


c  -  c  ac  ; 

O  O  J 

3*  -C  —  90  ) 

<“■«  J  3  C  J 


o  5  2*  3 

o  *  S  a 
|  *3e 


r  -s  .a  -  5 

*03- 

S  ^  s  ^ 

J  <  2  35 

a  §& 

:  2  b  i  S 

;  3  ' 

:  s|^  S 

:  1  ^  1  2 
ail  .S'  e  u 
-  U  *TT 
:«>«■“ 
*  j£  z  jz  <■ 

=  ls* 

_  3  V  2 

i  -  * ; 

■  3  S  £  .2 


cm  J  c:  v  ~  2  j 

o i-  * *  £  * > 

:  i*  -5  a  ^  ' 

i|  8|.S^i 

O  2  *a  2  o  »  C  ■ 


—  =  r3  a#  c  a.  •  < 

5  S  c  —  E  " 

=  *  1 4  i :  i 

» =  p  1 1 h ; 

?4  2  S  3  : 


■J\  2  “  ^  U  M 

5si3i-c  ■ 

'3118*3  : 

—  C  s  -c  y  £  : 

g  ”  «  •->  -e  j- 

*  c  „  y  ,3 
o  2  ^  jg  3  c 

•js  -d  2  -  *  i  § 
1 « i  i  s  s  a 
h->  =  S  2 

*5  (_  -*  ^  U  2  c 

—  £  -  o  £  * 

£  2  3oJ2- 

^  O  c  O  S  ^  O 

U  3  O  E  V  o  : 

©  -  £  "5  3  =  - 


:  -  * : ; 
1 1 J  -s  g 

!  -c  H  .  - 


->■53 
e  5  u  »a 

.a  J  x  i*. 

a  a  “  e 

2  2  §  = 
»  *  =  £ 


30  3  - 

In; 

=  5  i  1 


2  «  -  y  ^ 

s  £  e  >  ~ 


—  ^  -  c  o  -  1 

2  -e  £  -  §  s 

*  -5  c  2  2  -  . 

*3363-3. 

_  u  «  -  3  3 

O  *  «  *  3  - 

“  3  «  C  >  ‘ 

H  2  ^  C  -  J 

i  u  -  u  5  i 

X  I  §  *  !  a  • 

,  2  |  3  1  •  S 

'  —  e  >  «  v 

e  3  =  >  > 


y  ^  *a 
“a  1-8 
5i3a 


1  js  Z  * : 


:  i  =  j' 
“  =  ?  J 
-S  !  i-2 


;  ’  £  3  4  4 

i  ^  •£.  -2  ^  - 

:  —  “  —  u  ■ 

i  —  so  y  * 

■  *  2  =*§>■■ 

■  so  a  y  5 

.  =  2  -3  u.  ■ 


1)  WUSON  Wit  *sm  S  (  OKI’lJSt  l  I  I'KdBl  I  M 


IslisS 

*  o  2  » 
sr  ^  y  o  3  o 
2  »  =  a  5  z 

»2  u  _  o 

Z.~%  1  u  = 

■5  -s  .2  >5  3 

> ’i  » 

!5  »i? 

u  h.  ■£  x 


sS!?*=1S  2-a2 


3  -  _  5  =  2 
53  S  =>  i  "  - 


u  o  u  u 
x  x  x  x 

2  1 


■5^  C£  s 
S  =  0  a  “ 
*  «  aa«5  e 

1 1  i  <8  S 


•  JS  2  1 


-  £  £ 

1  ■£  «° 
s'!-!  1 3 
3  2 -3 1  I -3 

£  1  "  “  oJ:  3 

8  IS  -s  -  83 

50,2  „  o  £  1  H 
■  in  x  S  2  S 

<  2  *5  —  -i* 

to  =  ^  !a  *  =  « 


£  w- 


-  2  2.  J  Q  |  x 


-  3  3  ■£  3  £ 
2  a  1  kC  3  ■ 


2xo* 


/ 


;  £  3  2  3  .  3  0 


=  3 
3  :  ' 

O  a  r 


=  5  a  *3  *  s 


.a  - 

2  u  5 


0*11 


3  X 


3  3  C  «/> 

:  =  3?c 


^  _  X?  O  2  . 


J  2 


2  *3  X 
— ,  1»  ~ 


s  S 

2  o 


5  y  o  « 

HI:' 

a.  y  w  - 

X  o  « 
50  „  •£  u 

Si2’ 


—  3 

j«  ;  ; 


3  5  >« 


;  is 


J  O  >  _  v  i 


v  =  A  1  3  _ 

£  =  -^3232 

=  3-"==^^: 

a  —  rtu3'=oV3 


•j  “ 

I  i  i-e 
‘  ~  ^ 
~~  o  “  Ji 

4  3  -5 

:  -i  a  ^ 


j^i 


!  1 1 1  I 

3  "  Q..S 

1-3  3.0 


ao  e  3 
,  Jr  a  » 

>  1  ■»- 


2-g^ 

u  3  “ 

*5 


5  3 
"  * 


,=  *  o 


3  r  = 
I  5  2  15 

f  -  -  2 

I  !|  i 

s  * 

;  v 
-  x  £  s 

-  3  3 

J  c  r 
2s  •—  i9 


T  y 
<J  Si 


v  s.g 

HI 

il!j 

2  M  3  < 

liS  .3  • 


.  as 

J  3 


x  i  «  . 


5  a 


r  a 


ao  O 
ao  c  s 
=  05 


i  5  «  _ 

!  J 

:  —  u  J* 


•  »  2  u  i 
S  -  '/I  iS  S 


C  o  ^  o  1/1  i; 

o  X  3  •—  X 


i?  a  0  = 

;  3  u  X  : 

’  V5  Q.  >,  J 

s  s  S  „ 

'  uj  -  4 


_  3  .. 

cos 


*  * 


3  V  C  _  - 


U  W*  3 

•£  3  o 
o  « 


a  3 
x  C/> 
=  2 


3  1  ^  «  2  5  c 


3  C  3 


x  e  O  1  c  2  2  2  3  =  £ 

x£523*-u.?s 

.  a  >i  "5  *  x  ie  -a 


u  _= 
« 
i 


3  ^  3  X  x 


<!| 


2  z  -s  a. 


=  v  S  t  ■ 


2j ; 
3*5 
2  8  l 

|  I  - 
13 


2s  2SS.ll 


2  2  S  K 


5  3  S  '  j  ■: 


t  “  u  U  £ 


X  -  y  > 


3  -  ^  X 


-S3 

?’?  *2  ’ 


-  5*  3  5  >»  ‘ 


£  ^ 


C  -  u>  -t 

2  535 


’J  5  **  3 

=  £.  =  3  c 

•j  rr  -j  x  -j 


a  5  " 
i.3  JS 


o*=os. 


3  3  -  c 

-  x  3*  «  - 

^  -2  £  u 

O  ao  3  -3 


“  «  y* 

ill 


£  -  .2  >- 


o  c 

3 

3  ao 


C  X  ~ 

«  2  i 


£  1 

_  ^  -  C  3 

3^1: 


ao  it  "= 

3  6  * 

J  5  X 
>1  c 
C  X  i 

x  £  Ji 
Z  x  3 

i  X  ’/) 

-  ?s 

III 


L 


Wit  Kbit  I  I  S  t  OKPlISt  I  I  PKOlil  I  M 


o  5  : 

C  *3 
a 

s  c 

a 

■5  * 


a  u 
3  5 

e  a  » 


o  2  x  -3  =s 
—  y  —  x  a  u 


§■- 


Ji. 


SI 


m  30 

a  ^ 

_C  — 


js  ; 

=  g  , 


■31 

-  5 

V  -  u 

-  =  S  2 


£  =  Ji 

a  2  „ 

3  a  2 

?  2  s 

!“»  S  jj-l-2 

^ia=l 

■  *  S  5 1  a 


=  —  “ 
:  3  » 

*  3  *• 


a  — 

Q.  w 

2  S 

I? 


s-S 


"*  a. 
0 

lie 

^  T  V 


Z  J* 

c  : 

-  QO  - 
1  =. 
z  S. 

8  I 

u  ,J 

6  >. 

_ 

I  "3  *  .-3  §■ 


a  3 

tfZHl 

33." 

-3  3 


»a  ?;? 


-=  -3 
X  V 

3  3, 


-  JO 

_  £  3  I 
3  *  |  I v 

3  c  2  a- 
5  y  ^ 

5  e  w 
f  3  3  -2 


£  r  — 

.2  35 


a  -j, 
in  ”D  u 
5  —  30 

all 


3  C 

£  3 


2 


-533 


£  J2 
1  U  -3 
** 


=  I  2  1  2  4  i  it 


ya  3.  £  2  3 


2  E 


i  £ 


< 

2 


yi 

2 

u 


^  2: 

S  £ 

Ti 


u  a  . 
£  .3 


r=. 


1  §  1 

x  3  3 


c  —  s  a 


“  «  .  -  3  J< 


i| 


a  a  ■ 


V  I>  U  ! 

?  3  £ 

-  a  e 


»  M  . 


Z  £ 


1  2?  if 
1  .2  » 


c  £  u  £  ' 

I  -3  3  £ 


a  « 
o  3 


c  Jr 

«  u  •= 
O  XJ  2a 
£■£*5 

5  ^  ^ 
*>•§  2 
5*  3  -J 

IS  b 

!  2  3  - 


-  ^ 
2  w 
E  £ 

—  u 
v»  > 


i 

—  _3 

f  5 

—  -3 

*  S 


—  C  ' 
2  C 


«  U.  -»  w  ^ 


.  -  2  =  S  Js  g 


•j  3B 

$  s 


30  /i  i 


^  is 
5  cL 


*  -  Z  = 


s  5 


— -  3 

[  — ;  2  u 

L  ^  -  i  . 


3  U  ” 


7i  0 

a 

“  -a  ' 


—  j  -j 


;?3- 

2  |  ® 

I  f  = 

“  £2 

;  J  II- 

*  c  <  35 
3  W  ' 

Is, 


§! 
i  .2- 


*  -  - 
-s^1! 
1  :ji 


3 

M 


5  3 
£  = 


3  *?z 


i  —  -3 

=  2-8 


=  < 

a.  y) 


5  1  £“  i 

Q.  3  =  ' 
-  30  4 


•  "3  S  ,_ 

!*!f 

jj  £  * 

j  J*  § 

i  £  c  2? 


;  III1 v 

;  c  ^  a  3  y 

?  H  = 

-  E  3  ^  a  ^ 

S  3  -c  ^  V  T 

2  ^ «  «  a : 

S  -S  H  *  a  S 

1  i  t!l!^ 

:  >*  J4  : 

:  *2  ^  : 

.  a  -73  2  —  —  , 

!  s  a  3  ^  73  1 

2  "3  a  2  v  . 

:  *2  J2  *5  5  i 

?  g  T5  2  3  ?  - 

a  >  a 


3"  lii  i  i  3P5 


conjuiicliun  wiih  simple  bmoolhing  Simuialiun  iludics.  strongly  suggest  that, 
unlike  (tie  I  M  algvxithm,  the  new  procedure  converges  quickly  to  a  unique 


lulH.ii  imlcpcmkill  ol  the  starling  value  I  he  amuunl  ul  compute,  ii.occv.ing  f  ^  M  ,  Sl „  w  Jnj  ,  u  „„j,  *  lIt,ooib«d  tM  approach  m  •  tla> 


Appendix  7 


An  evaluation  of  the  ICM  algorithm  for  image  reconstruction. 


by 
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An  Evaluation  of  the  ICM  Algorithm  for  Image  Reconstruction 

R.  H.  GLENDINNING 

School  of  Mathematical  Sciences, University  of  Bath,  Bath  BA2  7 AY,  U.K. 

We  examine  the  properties  of  Iterated  Conditional  Modes  (ICM)  estimation  for 
a  number  of  synthetic  binary  images  using  simulation. 

KEY  WORDS  :  Ill-posed  problem; image  reco nstruction;ICM ;  Simulated 
Annealing  .smoothing  parameter;  neighbourhood  system;Monte-Carlo. 

I.  INTRODUCTION 

In  the  last  few  years  considerable  interest  has  been  shown  in  the  problems 
posed  by  the  analysis  of  images  corrupted  by  random  noise.  The  reconstruction 
of  such  images  leads  to  special  difficulties  as  it  is  an  ill-posed  problem  (  in  the 
sense  described  by  O’Sullivan,  1986  ).  Typically  the  reconstruction  of  an  array 
of  pixels  will  have  as  many  parameters  as  observations.  A  number  of  tech¬ 
niques  have  been  proposed  which  solve  ill-posed  problems  by  restricting  the 
class  of  admissible  solutions.see  Marroquin.Mitter  &  Poggio  (1987).  This  is 
achieved  by  introducing  a  priori  knowledge  about  admissible  solutions. 

'  Much  interest  currently  centres  on  techniques  which  incorporate  knowledge 
about  the  underlying  image  using  Bayesian  methodology.  See  Geman  &  Geman 
(1984)  ;  Kashyap  St  Lapsa  (1984).  These  techniques  assume  that  the  underly¬ 
ing  scene  can  be  adequately  described  as  a  realisation  from  a  prescribed  Mar¬ 
kov  random  field.  Motivated  by  this  approach  Besag  (1986)  introduced  a  tech¬ 
nique  known  as  ITERATED  CONDITIONAL  MODES  (ICM).  This  iterative 
procedure  incorporates  knowledge  about  the  underlying  scene  by  the  choice  of 
a  ‘neighbourhood  system’  .weight  function  and  smoothing  parameter.  Broadly 
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speaking  this  method  exploits  the  tendency  of  adjacent  pixels  to  have  the  same 
colour.  A  similar  approach  based  on  spatial  auto  regression  is  described  in 
Woods  ,Dravida  &  Mediavilla  (1987). 

In  this  paper  we  use  simulation  to  evaluate  the  performance  of  ICM  in 
reconstructing  binary  (  black-white  )  images.  The  reconstruction  of  binary  im¬ 
ages  is  of  considerable  practical  importance  as  many  problems  in  object  recog¬ 
nition  and  manipulation  fall  into  this  category.  For  simplicity  we  suppose  that 
the  underlying  scene  can  be  partitioned  into  an  array  of  pixels  (  picture  ele¬ 
ments  )  which  are  uniquely  coloured  black  or  white.  At  each  pixel  we  observe 
a  signal  which  depends  on  its  colour.  We  consider  the  case  where  each  signal 
is  additively  corrupted  by  independent  normally  distributed  noise.  These  are 
highly  unrealistic  assumptions  as  they  ignore  the  problems  associated  with 
mixed  pixels,  signal  spread  etc.  However  we  believe  that  the  study  of  ICM  in 
this  simplified  setting  will  give  valuable  insight  into  its  behaviour  in  more  com¬ 
plex  situations. 

In  section  2  we  describe  the  basic  ICM  algorithm  and  recall  some  basic 
facts  about  Markov  random  fields.  The  synthetic  scenes  used  in  this  study  are 
described  in  section  3.  In  section  4  we  examine  the  influence  of  the  neighbour¬ 
hood  system  and  weight  function  on  the  quality  of  our  reconstructions.  The 
choice  of  smoothing  parameter  is  discussed  in  section  5.  We  are  particularly  in¬ 
terested  in  identifying  properties  of  the  underlying  scene  which  influence  the 
value  given  to  (3  (the  smoothing  parameter).  Some  distributional  properties  of 
ICM  reconstructions  are  discussed  in  sec  ion  6.  The  numerical  performance  of 
the  basic  ICM  algorithm  is  discussed  in  section  7.  We  describe  several 
modifications  of  the  oasic  algorithm  which  enhance  its  efficiency.  Our  findings 
are  summarised  in  section  8. 

The  problem  of  restoring  corrupted  images  has  a  long  history  in  the  image 
processing  literature,  where  a  number  of  techniques  of  varying  sophistication 
have  been  suggested,  see  Bovik.Huang  &  Munson  (1987)  or  Rosenfeld  &  Kak 
(1982).  A  comparison  of  ICM  with  the  multitude  of  competing  techniques  is 


not  attempted  in  this  paper. 


2.  THE  ICM  ALGORITHM  AND  MARKOV  RANDOM  FIELDS 

9 

Let  W  be  a  rectangular  window  in  the  plane  which  is  partitioned  into  an 
(m  x  n)  array  of  rectangular  pixels  of  equal  size.  We  assume  that  each  pixel 
can  be  uniquely  coloured.  The  available  colours  are  labelled  (l,2,...,c).  In  this 
paper  we  restrict  attention  to  scenes  with  two  colours  which  we  call  black  and 
white.  The  colour  of  the  (ij)*  pixel  is  denoted  by  xIy.  We  refer  to  (x1;)  as  the 
true  or  underlying  scene.  Suppose  we  observe  an  array  of  signals  (yi;  )  generat- 

yii  =nCxiy)  +  ey,  (2.1) 

where  (£,-,•)  are  independent  and  identically  distributed  random  variables  and 
|i(.)  is  a  function  of  x,.  only.  The  object  of  image  analysis  is  to  estimate  the 
true  or  underlying  scene  (xi;)  from  (y,y).  In  this  paper  we  consider  real-valued 
signals  only.  Models  of  this  form  are  not  canonical  in  the  study  of  corrupted 
images  and  the  reader  is  referred  to  Besag  (1986)  for  a  discussion  of  alternative 
models. 

At  first  sight  the  natural  way  of  estimating  (x,y )  is  by  maximum  likelihood. 
In  this  approach  we  find  (xi;)  which  maximises 

1  (  (ytJ )  1  )  )  =  YlYlf  0 ’ij  I Xij).  (2.2) 

<  =  ly=! 

where  /(yi;  lx,;)  is  the  fully  specified  density  function  of  yi;  conditional  on  x(j . 
The  estimates  produced  by  this  approach  are  usually  unsatisfactory  as  (2.1)  has 
as  many  parameters  (xiy)  as  observations.  To  improve  the  situation  Geman  & 
Geman  (1984)  and  Besag  (1986)  introduce  information  about  (xi; )  into  the 
estimating  procedure.  This  is  achieved  by  regarding  (*,-•)  as  a  realisation  from 
a  Markov  random  field  (  MRF  )  .  A  detailed  account  of  the  salient  features  of 
MRF’s  can  be  found  in  Geman  &  Geman  (1984)  ;  Besag  (1974,1986)  or 
Suomela  (1976).  We  briefly  outline  the  main  properties  of  MRF’s  relevant  to 
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the  discussions  in  this  paper. 

For  each  pixel  (ij)  we  associate  a  set  of  pixels  ,  not  including  (ij) 
called  the  neighbourhood  of  (ij).  The  collection  of  sets  is  called  a 

neighbourhood  system  and  satisfies  the  condition 
(P  J)£  F  (,4). 

Then  (Xy )  is  a  MRF  if 

(1)  P(Xy  \(x^q}*i,q*j))  =  P (Xy  I xM,(p,q) 

(2)  P((Xy))  >  0, 

where  P  ((xi; ))  is  the  probability  associated  with  the  realisation  (Xy).  Condi¬ 
tions  1  and  2  impose  severe  restrictions  on  P  (.).  Valid  forms  of  P(.)  are  given 
by  the  Hammersley-Clifford  Theorem,  see  Besag  (1974)  or  Suomela  (1976). 

We  follow  Geman  &  Geman  (1984)  and  adopt  a  Bayesian  approach  where 
we  estimate  (Xy)  from  its  posterior  distribution 

l  ((yij) 1  (xij  ))P  ((Xq  ))•  (2.3) 

A  plausible  estimate  of  (xy)  is  the  value  of  (Xy)  which  maximises  (2.3).  This  is 
the  MAP  estimate  of  (xy).  Geman  &  Geman  (1984)  use  simulated  annealing  to 
maximise  (2.3).  Van  Laarhoven  &  Aaris  (1987)  give  a  comprehensive 
description  of  simulated  annealing  and  its  application  to  image  analysis.  Note 
that  Greig.Porteous  and  Seheult,  in  the  discussion  of  Besag  (1986)  show  that 
the  MAP  estimate  of  a  binary  scene  can  be  calculated  exactly.  It  is  not  known 
whether  the  MAP  estimator  has  any  desirable  properties  in  this  context. 

Besag  (1986)  introduces  an  alternative  estimator  of  (Xy)  known  as 
ITERATED  CONDITIONAL  MODES  (ICM).  This  algorithm  converges  to  a 
local  maximum  of  (2.3).  Let  (xy)  be  the  current  estimate  of  (xi;).  For  each 
pixel  we  find  the  value  of  x(/  which  maximises 

/  0’y  \Xij)P  (Xy  I  (Xy  )),  (2.4) 

where  P  (x,;  l(Xy»  depends  on  the  neighbours  of  (i.j)  only.  Consider  an  exam- 
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pie.  Let  (Xy )  be  a  binary  scene  and  (e,7 )  an  array  of  independent  normally  dis¬ 
tributed  random  variables  with  zero  mean  and  variance  a2.  We  represent  our 
knowledge  of  (xi;)  by  a  MRF  with  neighbourhood  system 
(F(i  j )=((i -l,j ),(/ + 1 J  ),(i  J +1 )))  and  conditional  probabilities 


P(Xij=k  \(*pqp*i4*j)) 


exp(| lujjjk)) 

exp(pu,7(0))  +  exp(pu,7(l)) 


i=^),l  , 


(2.5) 


where  the  weight  function  u,7(jfc)  is  the  number  of  neighbours  of  jc,7  with  colour 
k.  The  value  of  jc,7  which  maximises  (2.4)  minimises 

(2a2)-1 [(yi}  -  \iQCij  ))2  -  Mi  (*ij ).  (2-6) 

where  A is  the  number  of  neighbours  of  (i  J )  which  have. colour  x(j 
under  the  current  estimate  (i1;  ).  We  call  (J  the  smoothing  parameter.  The  exten¬ 
sion  of  (2.6)  to  non-gaussian  noise  is  immediate. 

Notice  that  (2.6)  is  in  the  form  of  a  penalised  likelihood  and  may  be  inter¬ 
preted  in  this  way  without  recourse  to  Bayesian  arguments.  Note  that  ICM  and 
MAP  are  not  equivalent  for  most  scenes.  Typically  smaller  values  of  p  (  rela¬ 
tive  to  ICM)  are  required  for  MAP,  see  Greig,Porteous  and  Seheult,in  the  dis¬ 
cussion  of  Besag  (1986).  The  relationship  between  techniques  like  ICM  and 
other  regularisation  procedures  is  discussed  in  Titterington  (1985). 


3.  DESCRIPTION  OF  THE  SIMULATION  STUDY 


Seven  scenes  of  varying  complexity  were  constructed  by  partitioning  the 
unit  square  into  104  square  pixels  of  equal  size.  The  colour  of  each  pixel  was 
assigned  to  the  colour  of  its  mid- point.  In  this  study  we  use  black  and  white 
scenes  only. 

To  identify  properties  of  ICM  more  easily  we  restrict  attention  to  simple 
synthetic  scenes  which  cover  a  small  alphabet  of  forms  rather  than  use  naturally 
occurring  images.  Five  simple  geometric  scenes  are  displayed  in  figures  1  to  5. 
The  remaining  scenes,  MRF2  and  MRF3  (  figures  6  and  7  )  are  realisations 


-6- 


firom  a  Markov  random  field  with  prescribed  number  of  black  pixels  (approx 
50%).  MRF2  and  MRF3  were  constructed  using  the  algorithm  described  in 
Cross  &  Jain  (1983).  Notice  that  we  are  sampling  from  the  conditional  distribu¬ 
tion  of  the  prescribed  MRF.  We  believe  that  realisations  constructed  in  this  way 
capture  much  of  the  local  structure  of  the  unconditional  model.  In  the  next  sec¬ 
tion  we  describe  three  Markov  random  fields,  (  Models  I,II  and  HI)  which  are 
commonly  used  in  this  context.  MRF2  is  drawn  from  Model  II  with  3=0-5  and 
MRF3  from  Model  HI  with  3=0.75. 

We  construct  an  array  of  signals  (y,-,)  using  (2.1)  with 
\x(black)=\  ,  \x.{white )=0  and  (e,y)  an  array  of  independent  normally  distributed 
random  variables  with  zero  mean  and  variance  a2.  The  maximum  likelihood 
reconstruction  is  calculated  and  used  as  the  initial  state  for  the  ICM  algorithm. 
This  iterative  procedure  is  terminated  after  twelve  iterations.  Typically  conver¬ 
gence  occurs  after  six  iterations.  This  process  is  repeated  fifteen  times  for  each 
combination  of  parameter  and  underlying  model.  The  efficiency  of  this  algo¬ 
rithm  is  discussed  in  section  7. 

Many  criteria  can  be  used  to  evaluate  reconstructions.  Essentially  its  choice 
depends  on  the  image  characteristics  of  greatest  interest  In  this  paper  we  use 
the  number  of  misclassified  pixels  as  an  appropriate  measure.  The  suitability  of 
this  criteria  has  been  the  subject  of  much  recent  debate,  see  the  discussion  of 
Besag  (1986).  We  point  out  the  limitations  of  this  criteria  where  appropri¬ 
ate. 


4.  THE  CHOICE  OF  MODEL. 

In  this  section  we  examine  the  effect  of  choosing  three  different  weight  func¬ 
tions  in  (2.6).  The  choice  of  3  is  discussed  in  section  5.  In  a  Bayesian  frame¬ 
work  we  are  modelling  our  knowledge  of  the  uncorrupted  scene  by  a  MRF  with 
prescribed  structure.  Cross  &  Jain  (1983)  show  that  simple  MRF's  can 


Figs  1-7 
here 
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generate  a  wide  variety  of  binary  scenes.  The  problem  of  choosing  suitable 
MRF’s  to  model  specific  scenes  is  not  well  understood, see  Kashyap  & 
Chelappa  (1983)  ,  Enting  &  Welberry  (1978)  and  Pickard  (1987).  The  last  two 
authors  discuss  parameter  estimation  for  Markov  random*  fields.  An  additional 
complication  anises  when  our  knowledge  about  the  underlying  scene  is  impre¬ 
cise  or  difficult  to  model  by  a  MRF.  The  success  of  this  approach  rests  on  the 
assumption  that  only  certain  modest  properties  of  our  ‘prior’  are  important. 
Some  tentative  observations  on  the  robustness  of  ICM  reconstruction  to  model 
specification  are  given  in  sections  4  and  5. 

In  this  section  we  use  three  different  MRF’s  to  describe  our  knowledge 
about  the  scenes  presented  in  figs  1  to  7.  We  examine  the  misclassification  rate 
achieved  by  ICM  using  each  model  and  several  values  of  the  parameter  p.  The 
models  used  are  as  follows: 


MODEL  I :  A  first  order  neighbourhood. 

=  ( (i-l,y).(i>lj).(ij+l),(iJ-l)  ). 
exp(pui;(k)) 

where 

upq(k)  =  1  ,  when  (p,q)zF{ij)  and  xpq=k, 

and  zero  otherwise. 


(4.1) 


(4.2) 


MODEL  II:  A  second  order  neighbourhood. 


F (i,j)  =  (  (i-l,y+l),(i+l  j+l),(i-l,y-l),(i+l  j-1), 


0  - 1 J  )>(» ./ +1 ),(/’ + 1 J  ).0  J  ~  I )) 


P{Xij=k  IF(i;))  is  given  by  (4.1)  and  (4.2). 
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MODEL  m  :  As  for  II  with  down  weighted  diagonals.  F(i;)  as  for  the 
previous  model  and  P (xi;  =k  given  by  (4.1)  with 

Upq(k)  -  1  ,  (P^)e((i+lj),(j-lj),(ij'+l),(i,y-l))  and  jtw  =i. 

upq(k)  =  TM  ,  (p,^)6(0-lJ+l),0+lj-l),(i+l,y+l).0-lj-l))  and  =  *. 

Upqik)  =  0  otherwise.  (4.3) 

There  are  conflicting  opinions  as  to  whether  models  should  be  modified  for  pix¬ 
els  adjacent  to  the  window, see  Ripley  (1984).  In  this  study  we  use  the 
unmodified  models  IJI  and  HI.  The  effects  of  modification  appear  small  rela¬ 
tive  to  the  standard  errors  encountered  in  this  study.  Cross  &  Jain  (1983)  show 
that  models  like  IJI  and  HI  can  be  used  to  construct  a  wide  variety  of  binary 
scenes. 


TABLE  I 


Comparison  of  models  I,  II  and  HI 
Smallest  average  percentage  of  misclassified  pixels 
P  taking  values  in  (0.25,0.5,0.75,1.0,1.25,1.5)  for  Models  H  and  HI 
p  taking  values  in  (0.5,1,0,1.5,2.0,2.5,3.0)  for  Model  I 
The  standard  error  of  this  estimate  is  given  in  brackets 
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Each  scene  described  in  figs  1  to  7  is  reconstructed  using  models  I,n  and 
HI  with  various  values  of  o2  and  p.  For  models  n  and  in  we  find  the  value  of 
P  in  the  set  (0.25,0.5,0.75,1.0,1.25,1.50)  which  gives  the  smallest  average 
misclassification  ram.  For  model  I  we  consider  values  of  P  in  the  set 
(0.5, 1 .0, 1 .5,2.0,2.5,3.0) .  We  choose  different  values  of  P  for  model  I  as  there 
is  strong  empirical  evidence  that  the  ‘optimal’  value  of  P  lies  in  this  range  for 
the  scenes  considered.  In  Table  I  we  display  the  smallest  average 
misclassification  rate  for  a2  =  0.5  and  1.0  .  Similar  results  were  obtained  using 
different  values  of  a2.  Notice  that  ICM  is  superior  to  the  ML  estimate  for  all 
scenes.  It  is  readily  apparent  that  model  I  is  vastly  inferior  to  n  and  m  for  all 
scenes  considered.  Model  in  is  marginally  superior  to  model  n  in  the  majority 
of  the  scenes.  In  their  study  of  edge  penalties  Brown  and  Silverman  (1987) 
present  an  argument  which  supports  the  use  of  model  HI  in  preference  to 
Model  n  for  the  majority  of  scenes.  Recall  that  MRF2  and  MRF3  are  realisa¬ 
tions  from  a  Maikov  random  field  with  a  fixed  number  of  black  pixels.  Using 
the  ‘correct’  neighbourhood  system  appears  to  have  little  effect  on  the  quality 
of  the  reconstruction. 

As  the  ‘optimal’  P  will  usually  be  unknown  we  examine  the  average 
misclassification  rates  for  model  II  and  III  for  several  values  of  p.  The  average 
percentage  of  misclassified  pixels  is  presented  in  Tables  II  to  VH  for  various 
values  of  p. 

In  Tables  II  and  EH  we  display  the  average  percentage  of  misclassified  pix¬ 
els  using  models  II  and  m  for  various  values  of  p  and  o2^^.  Similar  results 
were  obtained  for  other  values  of  a2.  There  is  strong  evidence  to  suggest  that 
the  ‘optimal’  value  of  P  using  model  III  is  larger  than  the  corresponding  value 
for  model  EL  In  figure  15  we  compare  the  average  percentage  of  misclassified 
pixels  when  MRF3  is  reconstructed  using  models  II  and  m  {(^=0.5).  We  plot 
the  average  percentage  of  misclassified  pixels  using  model  n  against  p.  For 
Model  DI  we  plot  the  corresponding  percentage  against  (1/I.17)p.  From  this 
figure  we  see  that  a  useful  first  approximation  is  to  multiply  the  value  of  P  used 
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with  model  II  by  1.17  when  using  model  IQ.  This  ensures  that  the  second  term 
in  (2.6)  has  the  same  value  for  both  models  when  t2y  (Xy)=8. 


TABLE  Q 

Average  percentage  of  misclassified  pixels  using  Model  Q 
Standard  errors  in  brackets 
Optimal  reconstruction  is  bold  faced 

a2  =  0.5 


p 

BCIR 

CROSS 

TWO 

MANY 

VMANY 

MRF2 

MRF3 
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TABLE  m 

Average  percentage  of  misclassified  pixels  using  model  IQ 

Standard  errors  in 

brackets.  Optimal  reconstruction  is  bold  faced 

o2 
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0.75 

0.60 

1.01 

1.08 

2.38 

7.37 

8.01 

4.98 

(0.04) 

(0.07) 

(0.05) 

(0.09) 

(0.15) 

(0.11) 

(0.10) 

1.00 

0.64 

0.98 

0.97 

121 

8.26 

8.63 

5.20 

(0.04) 

(0.06) 

(0.06) 

(0.08) 

(0.19) 

(0.08) 

(0.09) 

1.25 

7.11 

1.08 

1.25 

2.81 

9.76 

9.25 

5.72 

(0.04) 

(0.06) 

(0.09) 

(0.08) 

(0.25) 

(0.11) 

(0.09) 

1.50 

6.87 

1.08 

1.44 

3.13 

11.24 

9.72 

6.22 

(0.04') 

-  (Q-08) 

(0.09) 

(0-11.)  _ 

_JQi2Q) _ 

_  (0-12) 

(0.11) 

In  Tables  IV  to  VII  we  present  the  analogous  results  for  black  and  white 
pixels.  These  results  are  similar  to  those  in  Tables  II  and  III.  Notice  that  the 
'optimal’  value  of  (3  is  larger  for  white  pixels  than  for  black  in  the  majority  of 
scenes.  This  .may  be  due  to  the  higher  proportion  of  boundary  pixels  for  black 
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features  in  most  scenes  (  see  Table  DC). 


TABLE  IV 

Average  percentage  of  black  pixels  classified  white  using  model  Q 
Standard  errors  in  brackets. 

Optimal  reconstruction  is  bold  faced 

6^0.5 


KM 

BCIR 

CROSS 

TWO 

MANY 

VMANY 

MRF2 

MRF3 

0.25 

4.43 

7.72 

11.91 

16.16 

9.51 

7.70 

(0.15) 

(0.34) 

(0.28) 

(0.39) 

(0.33) 

(0.20) 

(0.18) 

0.50 

0.77 

4_30 

4.87 

12.13 

18.46 

7.87 

4.98 

(0.06) 

(0.41) 

(0.21) 

(0.44) 

(0.48) 

(0.16) 

(0.12) 

0.75 

0.42 

5.36 

5.33 

14.80 

24.96 

8.11 

5.04 

(0.06) 

(0.37) 

(0.33) 

(0.67) 

(0.54) 

(0.17) 

(0.14) 

1-.00 

0.37 

5.21 

5.94 

16.98 

32.37 

8.86 

5.36 

(0.05) 

(0.34) 

(0.58) 

(0.65) 

(0.74) 

(0.16) 

(0,20) 

1.25 

0J0 

7.37 

6.70 

22.43 

39.91 

9.05 

5.98 

(0.03) 

(0.72) 

(0.41) 

(1.07) 

(0.81) 

(0.28) 

(0.27) 

1.50 

0.36 

7.23 

8.12 

25.37 

46.71 

10.34 

6.70 

(0.04) 

(0-8D 

(0.84) 

(0.92') 

(0.30) 

(0.16') 

However  the  accurate  estimation  of  the  ‘optimal’  value  of  P  is  difficult  in  many 
cases  as  the  plot  of  the  average  misclassification  rate  against  P  (see  figs  8  to 
14)  is  J-shaped  in  the  area  of  interest 


TABLE  V 

Average  percentage  of  black  pixels  classified  white  using  model  m 
standard  errors  in  brackets 
Optimal  reconstruction  is  bold  faced 

0^=0. 5 


KM 

BCIR 

CROSS 

TWO 

MANY 

VMANY 

MRF2 

MRF3 

0.25 

6.27 

9.37 

9.12 

12.75 

16.42 

10.91 

9.13 

(0.12) 

(0.33) 

(0.34) 

(0.34) 

(0.25) 

(0.20) 

(0.19) 

0.50 

1.16 

4.52 

4.81 

11.30 

16.88 

8.08 

5.33 

(0.06) 

(0.37) 

(0.19) 

(0.37) 

(0.36) 

(0.21) 

(0.16) 

0.75 

0.49 

5.11 

4.64 

13.05 

21.26 

7.74 

4.98 

(0.06) 

(0.40) 

(0.27) 

(U.60) 

(0.46) 

(0.14) 

(0.14) 

1.00 

0.40 

4.82 

4.48 

14.55 

26.60 

8.47 

5.25 

(0.06) 

(0.41) 

(0.39) 

(0.56) 

(0.70) 

(0.16) 

(0.20) 

1.25 

0.35 

6.35 

5.63 

18.96 

32.26 

8.62 

5.48 

(0.04) 

(0.59) 

(0.39) 

(0.86) 

(0.87) 

(0.18) 

(0.22) 

1.50 

0.37 

6.05 

6.05 

21.36 

39.30 

9.67 

6.19 

(0.04) 

(0.63) 

(0-58)  _ 

(0.87) 

_ -0.031- 

(0.23) 

(0.17) 

m  ■  ■■ 
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TABLE  VI 

Average  percentage  of  white  pixels  classified  black  using  model  II 
Standard  errors  in  brackets 
Optimal  reconstruction  is-  bold  faced 

02=0.5 


a  BCIR  CROSS  TWO  MANY  VMANY  MRF2  MRF3 


0.25 

4.61 

4.45 

4.60 

5.08 

7.58 

10.22 

7.79 

(0.11) 

(0.12) 

(0.10) 

(0.13) 

(0.16) 

(0.24) 

(0.19) 

0.50 

0.83 

0.69 

0.80 

1.07 

3.21 

7.83 

4.85 

(0.05) 

(0.04) 

(0.05) 

(0.05) 

(0.12) 

(0.13) 

(0.19) 

0.75 

0.64 

0 35 

0J3 

0.77 

2.22 

8.79 

5.23 

(0.04) 

(0.06) 

(0.06) 

(0.05) 

(0.11) 

(0.16) 

(0.15) 

1.00 

0.82 

0.58 

0.54 

0.53 

1.71 

9.16 

5.60 

(0.07) 

(0.05) 

(0.05) 

(0.05) 

(0.10) 

(0.18) 

(0.14) 

1.25 

1.08 

0.60 

0.71 

0.53 

1.85 

10.63 

6.35 

(0.10) 

(0.07) 

(0.10) 

(0.05) 

(0.18) 

(0.28) 

(0.25) 

1.50 

0.95 

0.66 

0.90 

0.60 

1.66 

10.46 

6.85 

10.07) 

(0.05) 

(0.08) 

-(0-05)  . 

(0-10) 

(0.27) 

(0.16) 

TABLE  Vn 

Average  percentage  of  white  pixels  classified  black  using  model  HI 
Standard  errors  in  brackets 
Optimal  reconstruction  in  bold  face 

o2=0.5 


B  BCIR  CROSS  TWO  MANY  VMANY  MRF2  MRF3 


0.25 

6.25 

6.87 

9.17 

11.33 

9.28 

(0.14) 

(0.13) 

(0.12) 

(0.14) 

(0.15) 

(0.25) 

(0.18) 

0.50 

1.20 

1.06 

1.14 

1.60 

3.93 

7.87 

5.10 

(0.06) 

(0.05) 

(0.06) 

(0.05) 

(0.14) 

(0.15) 

(0.18) 

0.75 

0.68 

0.59 

0.59 

0.91 

2.60 

8.28 

4.97 

(0.05) 

(0.06) 

(0.05) 

(0.05) 

(0.08) 

(0.20) 

(0.14) 

1.00 

0.82 

0.58 

0.49 

0.57 

1.95 

8.80 

5.14 

(0.07) 

(0.05) 

(0.05) 

(0.04) 

(0.08) 

(0.13) 

(0-10) 

1.25 

0.99 

0-55 

0.64 

0.57 

2.02 

9.89 

5.96 

(0.08) 

(0.06) 

(0.08) 

(0.07) 

(0.17) 

(0.22) 

(0.20) 

1.50 

0.93 

0.58 

0.80 

0.60 

1.59 

9.78 

6.25 

-10,07)- 

(0.05) 

(0.08) 

10.05) 

10.10) 

— 1026L 

(0.17) 

The  number  of  misclassified  pixels  is  a  crude  image  summary  which  takes 
no  account  of  the  spatial  characteristics  of  the  scene.  To  gain  further  insight 
into  the  differences  between  model  II  and  HI  we  use  an  image  summary  which 
counts  the  number  of  misclassified  pixels  close  to  the  true  boundary  between 
black  and  white  areas.  A  similar  procedure  was  suggested  by  Owen,  in  the  dis¬ 
cussion  of  Ripley  (1986). 
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TABLE  Vm 

Average  percentage  of  misclassified  boundary  pixels 
for  MRF3.  Standard  errors  in  brackets 
The  optimal  reconstruction  in  bold  face 
(There  are  2712  boundary  pixels  in  MRF3) 


£ 


Model 

0.25 

msnm 

1.0 

1.25 

1.5 

H 

Boundary 

16.74 

16.02 

(0.23) 

17.35 

(0.17) 

18.33 

(0.21) 

20.30 

(0.22) 

21.42 

(0.25) 

H 

All 

7.74 

4.92 

(0.12) 

5.13 

(0.09) 

5.48 

(0.07) 

6.16 

(0.08) 

6.77 

(0.12) 

m 

Boundary 

17.27 

15.98 

(0.25) 

16.78 

(0.18) 

17.51 

(0.28) 

19.11 

(0.25) 

20.19 

(0.20) 

HI 

All 

9.20 

5.22 

(0.12) 

4.98 

fO.09! 

5.20 

(0.10) 

5.72 

(0.09) 

6.22 

(0.09) 

We  reconstruct  MRF3  using  models  II  and  HI  with  ct2^^.  The  average 
percentage  of  misclassified  boundary  pixels  are  displayed  in  Table  VHI.  In  this 
table  we  call  a  pixels  with  at  least  one  neighbour  of  a  different  colour  (in  the 
true  scene)  a  boundary  pixel.  It  is  immediately  apparent  that  the  majority  of 
misclassified  pixels  lie  near  colour  boundaries  when  moderate  values  of  (3  are 
used.  When  MRF3  is  reconstructed  using  model  HI  and  [i=0.5  there  are 
approximately  433  misclassified  boundary  pixels  and  89  elsewhere.  There  is 
some  evidence  that  the  optimal  reconstruction  of  boundary  pixels  require  a 
smaller  value  of  [3  than  the  scene  as  a  whole.  This  is  also  apparent  from  the 
example  described  by  Owen  in  the  discussion  of  Ripley  (1986).  There  appears 
to  be  little  observable  difference  between  Model  H  and  HI  using  this  image 
summary. 

5.  THE  CHOICE  OF  THE  SMOOTHING  PARAMETER. 


In  this  section  we  attempt  to  identify  features  of  the  underlying  scene  and 
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error  distribution  which  influence  the  choice  of  P  in  (2.6).  We  restrict  attention 
to  model  H  First  we  examine  the  relationship  between  the  ‘optimal’  value  of  p 
and  the  signal  variance  a2.  In  figures  8  to  14  we  plot  the  average  percentage 
of  misclassified  pixels  against  p  for  various  values  of  o2.  Notice  that  the  value 
of  P  which  gives  the  smallest  average  misclassification  rate  is  approximately 
the  same  for  all  values  of  o2  considered.  The  results  for  VMANY  (fig  12) 
behave  atypically.  In  this  respect  the  I  CM  algorithm  differs  from  simple  linear 
regularisation  techniques  where  the  ‘optimal’  smoothing  parameter  is  typically 
proportional  to  the  noise  to  signal  ratio  ,  Hall  &  Titterington  (1986,  p  336).  The 
effect  of  grossly  misspecifying  o2  can  be  large  as  the  example  given  in  figure  7 
•of  Ripley  (1986)  shows.  However  the  relative  stability  of  the  misclassification 
rate  to  changes  in  P  chose  to  its  ‘optimal’  value  suggests  that  ICM  is  robust  to 
modest  misspecification  of  a2.  We'  see  from  figs  8  to  14  that  worthwhile  gains 
can  be  achieved  using  the  ‘optimal’  value  of  p.  • 

In  the  remainder  of  this  section  we  examine  the  relationship  between  the 
‘optimal’  value  of  P  and  certain  features  of  the  underlying  scene.  First  we  con¬ 
sider  the  relationship  between  the  ’optimal’  value  of  p  and  its  maximum 
pseudo-likelihood  estimate.  In  this  approach  we  calculate  the  value  of  P  which 
maximises  the  conditional  likelihood 

rirW,  i^(..;).p)-  (5.D 

i=i;=i 

From  Table  IX  we  see  that  the  pseudo- likelihood  estimates  of  P  using  model  II 
(Gw)  are  usually  greater  that  the  value  of  P  giving  the  smallest  average 
misclassification  rate.  This  behaviour  may  be  due  to  the  fact  that  the  majority 
of  scenes  considered  are  untypical  realisations  from  a  MRF.  For  the  scenes 
constructed  by  sampling  from  a  conditional  MRF  a  different  pattern  emerges. 
In  this  case  the  ‘optimal’  P  is  precisely  the  value  of  P  used  to  construct  the 
underlying  scene  (see  Tables  II,HI  and  IX),  provided  we  use  the  correct 
model  in  our  reconstruction.  The  pseudo-likelihood  approach  has  the 


Figs  8-15 
here 
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disadvantage  of  indicating  an  infinite  value  of  {3  for  certain  pixel  configurations. 

Next  we  introduce  two  statistics  which  measure  the  smoothness  of  the 
underlying  scene. 

DEFINITION  :  TWO  IMAGE  SUMMARIES 

B  :  Total  boundary  length  between  black  and  white  pixels 
(  excluding  the  window ). 

Qt  :  The  number  of  pixels  which  have  at  least  one 

neighbour  of  a  different  colour  using  an  Ith  order 
neighbourhood. 

Notice  that  Switzer  (1976)  measures  the  ‘smoothness’  of  a  random  function  by 
the  total  arc  length  of  its  contour  plot  at  certain  levels.  Applying  this  measure 
to  binary  random  functions  gives  the  statistic  B.  The  image  summary  QT  can 
be  written  as  the  difference  between  the  statistics  e,  and  dt  defined  in  Ripley 
(1986,  p  94)  where  pixels  adjacent  to  the  window  are  neglected.  See  Ripley 
(1977)  for  a  discussion  of  image  summaries  and  their  application.  Notice  that 
Qt  =  2 B  for  many  scenes  (  see  Table  EX  for  several  examples  ).  These  statis¬ 
tics  differ  in  their  treatment  of  ‘small’  features.  An  isolated  black  pixel  will 
contribute  4  to  the  total  boundary  length  and  9  to  Qrj. 

There  is  strong  evidence  (see  Table  IX)  to  suggest  that  the  misclassification 
rate  for  a  feature  is  strongly  influenced  by  the  percentage  of  boundary  pixels  ( 
as  measured  by  or  boundary  length,  B  ).  This  effect  is  indicated  by  the 
difference  in  the  average  percentage  of  misclassified  black  and  white  pixels. 
The  scene  BCIR  appears  to  behave  in  an  anomalous  way.  There  is  some  evi¬ 
dence  (  see  Table  IX)  that  the  value  of  P  giving  the  lowest  average  proportion 
of  misclassified  pixels  decreases  as  the  proportion  of  boundary  pixels  (  as 
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measured  by  2t2  or  total  boundary  length)  increases.  The  value  of  (3  giving  the 
smallest  average  percentage  of  misclassified  pixels  gives  the  strongest  evidence 
for  this  relationship.  There  appears  to  be  little  difference  in  the  descriptive  abil¬ 
ity  of  QTl  and  B.  In  the  scenes  considered  we  see  that  the  pseudo-likelihood 
estimates  of  P  are  not  closely  related  to  the  smoothness  measures  described 
above. 


TABLE  IX 

Smallest  average  percentage  of  misclassified  pixels  using  model  II 
and  the  ‘optimal’  value  of  (3  vs  smoothness  measures. 

(  *  pseudo  likelihood  estimate  using  model  III) 

a2  =  0.5 


Picture 

black 

white 

all 

Qt, 

B 

P/ii 

BCIR 

0.30 

0.64 

0.55 

P 

.  1.25 

0.75 

0.75 

344 

172 

1.85 

pixels 

4300 

5700 

10000 

CROSS 

4.30 

0.55 

1.00 

P 

0.50 

0.75 

0.75 

516 

260 

2.09 

pixels 

926 

9074 

10000 

TWO 

4.87 

0.53 

1.11 

P 

0.5 

0.75 

0.75 

480 

240 

2.12 

pixels 

1225 

8775 

10000 

MANY 

11.91 

0.53 

2.41 

P 

0.25 

1.25 

0.5 

1248 

624 

2.62 

pixels 

1216 

8784 

10000 

VMANY 

16.16 

1.71 

7.11 

P 

0.25 

1.0 

0.5 

3776 

1888 

1.98 

pixels 

2560 

7440 

MRF2 

7.87 

7.83 

7.85 

4109 

2324 

0.50 

P 

0.5 

0.5 

0.5 

pixels 

5065 

4935 

MRF3 

4.98 

4.85 

4.92 

2712 

1453 

0.63 

P 

0.5 

0.5 

0.5 

(  *  0.75  ) 

pixels 

5065 

4935 

A  useful  indication  of  the  effectiveness  of  a  reconstruction  technique  can  be 
obtained  by  considering  its  properties  in  reconstructing  a  one  colour  scene.  In 
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Tabic  X  we  display  the  average  percentage  of  misclassified  pixels  when  a  one 
colour  scene  is  reconstructed  using  model  II.  For  values  of  (3  less  that  0.4 
appreciable  errors  are  incurred.  So  for  scenes  with  large  monochrome  areas  we 
should  choose  P  £  0.4. 


TABLE  X 

Average  percentage  of  misclassified  pixels  for  a  one  colour  scene 
(using  model  II)  for  various  values  of  c2 
Standard  error  in  brackets 


p 

0.2 

0.25 

0.3 

0.35 

0.4 

a2  =  0.25 

4.98 

(0.03) 

3.35 

(0.03) 

2.15 

(0.02) 

1.30 

(0.02) 

0.80 

(0.01) 

a2  =  0.50 

6.6 

(0.06) 

3.93 

(0.05) 

2.26 

(0.03) 

1.31 

(0.03) 

0.75 

(0.02) 

a2  =  0.75 

7.14 

(0.06) 

4.06 

(0.06) 

2.34 

(0.05) 

1.40 

(0.04) 

0.82 

(0.03) 

cr2  =  1.0 

7.24 

(0.08) 

4.25 

(0.07) 

2.61 

(0.06) 

1.59 

(0.05) 

1.06 

(0.05) 

a2  =  1.25 

7.49 

(0.08) 

4.46 

(0.09) 

2.68 

(0.07) 

1.87 

(0.07) 

1.31 

(0.06) 

cr2  =  1.50 

7.68 

(0.10) 

4.52 

(0.09) 

3.03 

(0.08) 

2.11 

(0.08) 

1.52 

(0.05) 

To  illustrate  this  point  further  consider  the  percentage  of  misclassified  pix¬ 
els  for  BCIR  with  O2^^.  Recall  that  the  majority  of  pixels  in  BCIR  are  far 
from  the  colour  boundaries.  In  Figure  XI  we  compare  the  percentage  of 
misclassified  pixels  using  ICM  with  the  percentage  of  misclassified  pixels  for  a 
one  colour  scene  using  the  same  model. 
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TABLEXI 

A  comparison  of  the  average  percentage  of  misclassified  pixels  of  BCIR 
and  a  monochrome  scene  when  reconstructed  using  model  II 
Standard  errors  in  brackets  (  60  realisations  for  mono  scene  ) 
Optimal  reconstruction  is  bold  faced 

02=0.25 


1 


mwm 

mwjm 

■m 

1.00 

1.25 

1.50 

BCIR 

4.53 

0.80 

0.55 

0.65 

0.75 

0.70 

(0.10) 

(0.03) 

(0.04) 

(0.04) 

(0.03) 

Monochrome 

3.34 

0.27 

<0.02 

<0.02 

<0.02 

('0.02') 

(0.01) 

IHI 

«0.001) 

gctltlllli 

(<0.001) 

The  optimal  reconstruction  is  obtained  with  (1=0.75,  where  the  percentage  of 
misclassified  pixels  is  0.55.  The  contribution  of  pixels  far  from  the  colour 
boundary  is  approximately  0.02%.  These  result  suggest  that  the  errors  incurred 
during  the  reconstruction  of  scenes  like  BCIR  occur  near  the  colour  boundaries 
for  moderate  values  of  (3  (  see  Table  VUI). 

Consider  a  black  pixel  which  has  k  white  neighbours  when  it  is  updated. 
The  probability  of  misclassifying  this  pixel  during  the  current  iteration  can  be 
calculated  from  (2.6).  In  Table  XU  we  display  this  probability  for  model  U  with 
independent  normally  distributed  noise  (o2=0.5). 


TABLE  XU 

The  probability  that  a  black  pixel  is  classified  white 
at  a  particular  iteration  when  it  has  k  white  neighbours 

02=0.5 


_ _ _ 3 

k 

0.25 

0.50 

1.0 

8 

0.98 

1.00 

1.00 

7 

0.92 

1.00 

1.00 

6 

0.76 

0.98 

1.00 

5 

0.50 

0.76 

0.98 

4 

0.16 

0.16 

0.16 

3 

0.08 

0.02 

0.00 

2 

0.02 

0.00 

0.00 

1 

0.00 

0.00 

0.00 

0 

0.00 

0.00 

0.00 

These  calculations  strongly  suggest  that  model  II  behaves  like  a  simple  majority 
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vote  when  j&1.0.  Table  XII  can  be  used  to  estimate  the  ‘vulnerability’  of 
image  features  for  various  values  of  3-  As  an  example  consider  the  comer  pix¬ 
els  (k=5)  of  a  black  rectangle.  This  configuration  is  highly  vulnerable  when 
(3^0.5.  As  ICM  is  an  iterative  procedure  this  calculation  will  not  give  the  pro¬ 
bability  of  misclassifying  a  given  pixel.  However  calculations  of  this  type  are 
useful  in  visualising  the  effect  of  ICM  with  various  values  of  |3  and  neighbour¬ 
hood  system.  Using  this  approach  to  choose  |3  is  analogous  to  a  method  sug1 
gested  by  Ripley  (1986)  with  the  important  addition,  that  information  is 
included  about  the  noise  distribution. 

6.  SOME  DISTRIBUTIONAL  PROPERTIES  OF  ICM 

There  appears  to  be  no  work  in  the  literature  on  the  distributional  properties 
of  the  ICM  estimator  of  (x,. )  or  any  functional  of  interest.  The  only  relevant 
work  is  due  to  Geman  and  Geman  (1984),  who  describe  how  to  sample  from 
the  posterior  distribution  of  (xly- ).  In  this  section  we  examine  the  variance  of  the 
percentage  of  misclassified  pixels.  The  number  of  misclassified  pixels  can  be 
regarded  as  a  functional-  of  the  scene  formed  by  a  comparison  between  (xi;- )  and 
its  reconstruction.  In  Table  XEH  we  display  the  average  percentage  of 
misclassified  pixels  with  its  standard  deviation  in  brackets  for  a^=0.5  and 
model  II.  The  figures  for  the  optimal  reconstruction  are  given  in  bold  face. 
Recall  that  ICM  is  a  ‘local’  procedure.  This  suggests  a  poisson  approximation 
for  the  number  of  misclassified  pixels.  The  coefficient  of  variation  of  the  per¬ 
centage  of  misclassified  pixels  at  the  ‘optimal’  value  of  (3  appears  to  decrease 
as  the  misclassification  rate  (and  complexity)  increases.  This  is  not  consistent  • 
with  a  poisson  assumption.  In  particular  we  see  from  Table  VUI  that 
misclassified  pixels  cluster  near  colour  boundaries.  The  skewness  (b  )  and 
kurtosis  (b  2)  of  the  percentage  of  misclassified  pixels  were  calculated  and  sug¬ 
gest  a  symmetric  distribution  with  b2  between  two  and  three.  These  are 
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tentative  conclusions  as  the  number  of  realisations  used  in  this  study  is  small. 


TABLE  Xm 

The  standard  deviation  ( in  brackets)  and  the  average  percentage 
of  misclassified  pixels  using  model  n 
The  optimal  reconstruction  is  bold  faced 

c^.5 


& 

■tWIM 

TWO 

MANY 

VMANY 

MRF2 

■HU** 

0.25 

4.53 

(0.40) 

4.75 

(0.42) 

4.96 

(0.33) 

5.91 

(0.49) 

9.78 

(0.34) 

9.86 

(0.54) 

7.74 

(0.48) 

0.50 

0.80 

(0.10) 

1.02 

(0.16) 

1.30 

(0.17) 

2.41 

(029) 

7.11 

(0J9) 

7.85 

(0.35) 

4.92 

(0.33) 

0.75 

0.55 

(0.14) 

1.Q0 

(0-26) 

1.11 

(0.18) 

2.48 

(0.34) 

8.04 

(0.73) 

8.44 

(0.39) 

5.13 

(0.25) 

1.00 

0.63 

(0.16) 

1.01 

(0.20) 

1.20 

(0.25) 

2.53 

(0.35) 

9.56 

(0.71) 

9.01 

(0.35) 

5.48 

(0.32) 

1.25 

0.75 

(0.21) 

1.22 

(0.33) 

1.44 

(0.38) 

3.19 

(0.41) 

11.60 

(0.96) 

9.83 

(0.45) 

6.16 

(0.47) 

1.50 

■uilMM 

1.78 

(0.46) 

3.61 

(0.46) 

13.19 

(1.10) 

10.40 

-.(Q-51) 

6.77 

(0.45) 

7.  COMPUTATIONAL  DETAILS 

Pseudo-random  deviates  distributed  uniformly  on  [0,1]  were  generated  using 
Wichmann  &  Hill  (1982).  We  take  ix=27631  ,  iy=5627  and  iz= 10234. 
Pseudo-normal  deviates  with  zero  mean  and  unit  variance  were  constructed 
using  the  Box-Muller  transformation.  The  first  step  in  our  algorithm  is  to  deter¬ 
mine  the  maximum  likelihood  estimate  of  (x(j ).  This  colouring  is  used  as  the 
initial  state  (  iteration  zero  )  of  our  algorithm.  Each  pixel  is  visited  in  raster 
scan  order  and  the  colour  of  the  (i  ,j )‘h  pixel  is  updated  using  (2.6).  The  cpu 
time  taken  by  our  algorithm  is  proportional  to  the  size  of  the  neighbourhood 
system  used,  the  number  of  pixels  and  the  size  of  o2. 

In  Table  XIV  we  display  the  average  number  of  pixels  whose  colour 
changes  during  the  k‘h  iteration  when  MRF3  is  reconstructed  using  model  II. 
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The  average  percentage  of  misclassified  pixels  is  also  presented.  In  this  table 
one  iteration  is  equivalent  to  a  complete  sweep  of  the  scene  (  10*  pixel  visits  ). 

Notice  that  the  majority  of  changes  occur  during  the  first  iteration  (more 
changes  are  made  as  [3  increases).  Typically  only  one  or  two  pixels  change 
colour  during  later  iterations.  This  pattern  is  repeated  for  each  combination  of 
scene,  a2  and  model  considered. 


TABLE  XIV 

Average  number  of  changes  per  iteration  and  percentage  of 
misclassified  pixels  for  MRF3  (model  II) 

Standard  errors  in  brackets 

cr2^ 


k 

chances 

(3=0.25 
%  miscl’d 

chances 

(3=0-50 
%  miscl’ d 

chances 

(3=1.0 

%miscl’d 

1 

1587 

9.84 

2117 

6.47 

2346 

6.58 

(8) 

(0.13) 

(12) 

(0.09) 

(10) 

(0.08) 

2 

206 

8.18 

189 

5.31 

153 

5.93 

(5) 

(0.12) 

(4) 

(0.08) 

(3) 

(0.08) 

3 

42 

7.87 

44 

5.07 

50.0 

5.70 

(2) 

(0.12) 

(3) 

(0.08) 

(2) 

(0.08) 

4 

12 

7.78 

16 

4.98 

21 

5.58 

(1) 

(0.12) 

(1) 

(0.08) 

(1) 

(0.08) 

5 

3 

7.75 

6 

4.95 

10 

5.52 

(1) 

(0.12) 

(1) 

(0.08) 

(1) 

(0.08) 

6 

1 

7.74 

3 

4.93 

5 

5.50 

(0) 

(0.12) 

(0.6) 

(0.08) 

(1) 

(0.08) 

12 

0 

7.74 

0 

4.92 

0 

5.48 

(0.12) 

(0.09) 

(0.08) 

This  suggests  the  following  modification  of  the  basic  algorithm: 


Pixels  are  only  updated  when  they  are  flagged  as  ‘active’ .  The  pixel  (ij)  is 
‘active’  when  the  colour  of  at  least  one  of  neighbours  has  changed  during  the 
current  iteration.  Pixels  are  visited  in  raster  order.  When  a  pixel’s  colour 
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changes  its  neighbours  become  active.  Pixels  are  de-activated  after  they  are 
updated. 

Using  this  algorithm  we  would  visit  (  see  Table  XIV)  less  than  nine  hundred 
pixels  on  average  (  using  a  second  order  neighbourhood  )  during  the  third  itera¬ 
tion.  We  expect  the  modified  algorithm  to  converge  after  approximately  3 
iterations  in  general.  To  obtain  further  gains  in  efficiency  we  might  ‘switch 
off  pixels  whose  colour  has  a  low  probability  of  being  changed  during  the 
current  iteration, see  Ripley  (1986).  For  example  a  pixel  which  has  no  neigh¬ 
bours  of  a  different  colour  can  be  de-activated. 

8.  CONCLUSIONS 


From  the  simulation  study  described  in  this  paper  we  suggest  the  following 
rules  of  thumb  for  prospective  users  of  ICM. 

/.  Should  I  use  ICM  ? 

Our  empirical  results  suggest  that  the  misclassification  rate  of  a  feature 
increases  with  the  proportion  of  boundary  pixels  (see  Table  DC  and  compare  the 
misclassification  rate  for  black  and  white  pixels).  Typically  small  feature  will 
be  ‘erased’.  If  the  aim  of  an  analysis  is  to  find  small  features  then  a  technique 
based  on  masks  will  probably  be  preferable  to  ICM.  However  it  is  apparent 
from  Table  I  that  substantial  gains  over  the  maximum  likelihood  estimate,  can 
be  achieved  by  smoothing. 

2.  Which  model  should  l  use? 

We  suggest  that  model  HI  should  be  used  in  the  absence  of  specific 
knowledge  about  the  uncorrupted  scene.  If  we  know  that  the  underlying  scene 
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is  non-homo geneous  we  can  exploit  this  by  using  a  hierarchical  model,  see 
Derm  &  Elliot  (1987)  or  Woods JDravida  &  Mediavilla  (1987). 

3.  What  value  of  p  should  l  use? 

This  is  a  difficult  question  to  answer  in  the  absence  of  any  information 
about  the  underlying  scene.  The  examples  considered  in  this  paper  suggest  that 
useful  gains  can  be  achieved  using  the  ‘optimal’  value  of  P  rather  than  a  port¬ 
manteau  value  of  ,  say  p=1.5.  We  distinguish  between  two  cases.  In  the  first 
we  assume  that  the  underlying  scene  is  a  ‘typical’  realisation  from  a  MRF. 
Then  the  ‘optimal’  reconstruction  is  obtained  using  the  neighbourhood  system 
and  value  of  p  specified  by  the  underlying  MRF.  When  the  underlying  scene 
cannot  be  regarded  as  a  ‘typical’  realisation  from  a  MRF  we  suggest  the  used 
of  smoothness  measures  such  as  the  total  boundary  length  in  the  choice  of  the 
‘optimal’  value  of  p.  In  both  cases  we  see  that  the  ‘optimal’  value  of  P  does 
not  depend  on  a2.  From  figs  8  to  14  we  see  that  there  is  some  leeway  in 
choosing  the  ‘optimal’  value  of  p. 

4.  Is  the  ICM  estimate  difficult  to  calculate? 

From  the  discussions  in  section  7  we  see  that  a  single  reconstruction  of  a 
binary  104  pixel  scene  can  be  computed  simply.  The  calculations  appear  well 
suited  to  parallel  implementation.  The  scene  VMANY  with  c^=<}.5  was  recon¬ 
structed  in  around  39  seconds  (using  model  II  with  P=0.5)  on  a  SUN-3  Work 
Station  with  a  floating  point  accelerator. 
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CAPTIONS  FOR  FIGURES  1  TO  15 

FIGURE  I  BCIR  :  Circle  centred  at  (30,30)  with  radius  40.  The  origin  is  at 
the  bottom  left  hand  comer  of  the  window  which  has  dimensions 
(0,100)x(0,100). 


FIGURE  2  CROSS  Two  rectangles  with  comers  at 

{ ( 10,40),(60,20),(70,30),(20,50) }  and  {(25,20),(30,15),(55,50),(50,55)} 


FIGURE  3  TWO  :  Two  rectangles  with  comers  at  {(10,40)  ,  (60,40)  , 
(60,50)  ,  (10,50)}  and  {(20,55)  ,  (65,55)  ,  (65,60) ,  (20,60)} 


FIGURE  4  MANY  :  Eight  circles  of  radius  6  centred  at  ,  (25,20)  ,  (45,20)  , 
(65,20)  ,  (80,20)  ,  (25,80)  ,  (45,80)  ,  (65,80)  ,  (85,80)  and  ten  circles  of  radius 
3  centred  at  (20,40)  ,  (35,40)  ,  (50,40)  ,  (65,40)  ,  (80,40)  ,  (20,60)  ,  (35,60)  , 
(50,60)  ,  (65,60)  ,  (80,60). 


FIGURE  5  VMANY  :  Eighty  circles  with  radius  3  and  centres  at 
(5+10y,10*-7)  for  y=l . 8  and  *=1,...,10. 


FIGURE  6  MRF2  :  A  synthetic  realisation  from  the  MRF  specified  in 
MODEL  II  with  P=0.5.  This  scene  was  constructed  using  an  algorithm  given  in 
Cross  and  Jain  (  1983). 


FIGURE  7  MRF3  :  A  synthetic  realisation  from  the  MRF  specified  in  Model 
ID  with  (3=0.75.  This  scene  was  constructed  using  the  algorithm  given  in  Cross 
and  Jain  (1983). 


FIGURE  8  A  plot  of  the  average  percentage  of  misclassified  pixels  against  (3 
and  a2  when  BCIR  is  reconstructed  using  MODEL  II 


FIGURE  9  A  plot  of  the  average  percentage  of  misclassified  pixels  against  (3 
and  o  when  CROSS  is  reconstructed  using  MODEL  II 


FIGURE  10  A  plot  of  the  average  percentage  of  misclassified  pixels  against  (3 
and  ct  when  TWO  is  reconstructed  using  MODEL  II 


FIGURE  II  A  plot  of  the  average  percentage  of  misclassified  pixels  against  P 
and  a  when  MANY  is  reconstructed  using  MODEL  II 


FIGURE  12  A  plot  of  the  average  percentage  of  misclassified  pixels  against  P 
and  a  when  VMANY  is  reconstructed  using  MODEL  II 


FIGURE  13  A  plot  of  the  average  percentage  of  misclassified  pixels  against  P 
and  a  when  MRF2  is  reconstructed  using  MODEL  II 
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FIGURE  14  A  plot  of  the  average  percentage  of  misclassified  pixels  against  |3 
and  a  when  MRF3  is  reconstructed  using  MODEL  n 


FIGURE  15  A  plot  of  the  average  percentage  of  misclassified  pixels  against  (3 
for  model  II  and  (1/1.1 17)(3  for  model  HI 
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Speed  of  estimation  in  positron  emission  tomography. 
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Several  algorithms  for  image  reconstruction  in  positron  emission  tomography 
(PET)  have  been  described  in  the  medical  and  statistical  literature.  We 
study  a  continuous  idealisation  of  the  PET  reconstruction  problem, 
considered  as  an  example  of  bivariate  density  estimation  based  on  indirect 
observations.  Given  a  large  sample  of  indirect  observations,  we  consider  the 
size  of  the  equivalent  sample  of  observations,  whose  original  exact  positions 
would  allow  equally  accurate  estimation  of  the  image  of  interest  Both  for 
indirect  and  for  direct  observations,  we  establish  exact  minimax  rates  of 
convergence  of  estimation,  for  all  possible  estimators,  over  suitable 
smoothness  classes  of  functions.  For  indirect  data  and  (in  practice 
unobservable)  direct  data,  the  rates  for  mean  integrated  square  error  are 
n-p/(p+2)  and  (n/log  respectively,  for  densities  in  a  class 

corresponding  to  bounded  square-integrable  pth  derivatives.  We  obtain 
numerical  values  for  equivalent  sample  sizes  for  minimax  linear  estimators 
using  a  slightly  modified  error  criterion.  Modifications  of  the  model  to 
incorporate  attenuation  and  the  third  dimension  effect  do  not  affect  the 
minimax  rates. 
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1  Introduction 

Tomography  is  a  non-invasive  technique  for  reconstructing  the  internal  structure 
of  an  object  of  interest,  often  in  a  medical  context  Positron  emission  tomography 
(PET)  deals  with  the  estimation  of  the  amount  and  location  of  a  radioactively  labeled 
metabolite  on  the  basis  of  particle  decays  indirectly  observed  outside  the  body. 
Emission  tomography  in  general,  and  PET  in  particular,  has  been  the  subject  of 
considerable  recent  research  in  nuclear  medicine,  and  has  attracted  the  interest  of 
statisticians  as  an  example  of  a  reconstruction  problem  involving  incomplete  and  noisy 
data. 

The  formulation  of  the  PET  problem  we  shall  consider  is  basically  that  given  by 
Shepp  and  Vardi  (1982)  and  Vardi,  Shepp  and  Kaufman  (1985).  Following  their 
convention  we  shall  consider  a  particular  PET  experiment,  where  the  brain  is  scanned 
by  counting  radioactive  emissions  from  tagged  glucose.  The  distribution  of  glucose 
within  the  brain  corresponds  to  the  glucose  uptake  mechanism,  and  so  a  map  of  the 
glucose  distribution  within  the  brain  gives  an  indication  of  the  pattern  of  the  brain’s 
metabolic  activity.  In  the  idealisation  we  shall  consider,  following  Vardi  et  al.  (1985), 
the  radioactive  tagging  of  the  glucose  gives  rise  to  emissions  of  positrons  distributed  as 
a  Poisson  process  in  space  and  time;  the  spatial  intensity  of  emissions  is  the  same  as 
the  distribution  of  glucose.  Each  positron  that  is  emitted  annihilates  with  a  nearby 
electron,  and  yields  two  photons  that  fly  off  in  opposite  directions  along  a  line  with 
uniformly  distributed  orientation.  One  or  more  rings  of  sensors  placed  around  the 
patient’s  head  make  it  possible  to  detect  the  photon  pairs  and  hence,  for  each  emission 
that  is  detected,  to  give  a  line  on  which  the  point  of  emission  must  have  occurred. 
However,  for  equipment  of  the  kind  discussed  here,  it  is  not  possible  to  detect  the 
position  of  the  emission  on  the  line. 

The  PET  problem  is  just  one  of  a  large  number  of  statistical  problems  involving 
indirect  observations  of  the  phenomenon  of  interest;  in  our  case  the  observations  are 
indirect  in  that  the  emissions  themselves  are  not  observed  directly.  Such  problems 
arise,  for  example,  in  geophysics,  in  stereoiogy  and  wherever  linear  deconvolution  with 
known  filter  is  required.  Our  aim  in  the  present  paper  is  not  just  to  study  the  PET 
problem  but  also  to  develop  theory  that  can  be  applied  in  many  other  contexts. 

In  a  typical  PET  scan,  a  large  number,  perhaps  one  to  ten  million,  radioactive 
emissions  are  recorded,  and  the  image  of  interest,  a  slice  through  the  patient’s  brain  or 
body,  is  reconstructed  in  some  way  from  this  apparently  vast  data  set.  But  is  ten 
million  observations  really  a  large  sample  in  this  kind  of  context?  One  way  of  gaining 
some  insight  into  the  problem  is  to  think  in  terms  of  equivalent  sample  sizes.  We 
make  some  smoothness  assumptions  about  the  image  of  interest,  and  then  ask  how 
accurately  it  could  possibly  be  reconstructed  given  a  particular  indirect  sample.  The 
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equivalent  sample  size  would  be  the  number  of  emissions  whose  original  positions 
could  yield  an  equally  accurate  estimate.  The  equivalent  sample  size  gives,  in  terms 
more  attuned  to  usual  statistical  intuition,  a  quantification  of  the  information  actually 
available  from  our  sample  of  ten  million  indirectly  observed  emissions,  and  hence 
gives  an  idea  of  how  much-  is  lost  by  the  indirect  nature  of  the  observation  process. 

In  Section  2  below,  we  formulate  the  reconstruction  problem  as  an  example  of 
nonparametric  bivariate  density  estimation  based  on  indirect  data,  in  fact  an  example 
of  a  linear  inverse  problem  in  a  function  space.  The  function  we  estimate  is  the 
intensity  function  of  emissions  in  the  slice  through  the  brain.  A  key  feature  of  our 
treatment  is  the  explicit  singular  value  decomposition  of  the  transform  linking  the 
unknown  density  with  that  of  the  observed  data.  The  main  conclusions  of  the  paper 
are  summarised  in  Section  3.  In  particular  we  give  in  Section  3  a  table  of  explicit 
equivalent  sample  sizes,  admittedly  for  our  mathematical  idealisation  of  the  PET 
problem.  In  Section  4  we  confine  attention  to  linear  estimators,  and  to  intensities 
falling  in  a  suitable  smoothness  class  of  functions.  We  find  the  exact  minimax  rates  of 
consistency,  that  is  the  rate  for  the  least  favourable  density  and  the  best  linear 
estimator.  We  then  show,  in  Section  5,  that  these  rates  cannot  be  improved  by 
extending  consideration  to  all  possible  estimators,  linear  or  non-linear.  Thus  we  do  not 
consider  particular  iterative  non-linear  algorithms  proposed  elsewhere  for  practical  use, 
but  instead  we  establish  the  best  possible  performance  achievable  by  any  estimator. 

Section  6  of  the  paper  considers  modifications  of  our  mathematical  idealisation  in 
order  to  take  account  of  attenuation  of  the  emitted  photons  and  of  the  three 
dimensional  nature  of  the  problem.  Our  broad  conclusions  carry  over  when  these 
effects  are  incorporated.  In  Section  7,  we  extend  our  results  to  some  error  measures 
based  on  the  derivatives  as  well  as  the  values  of  the  images  and  their  reconstructions. 
Finally,  in  Section  8,  we  make  some  concluding  remarks,  and  mention  some  possible 
issues  for  future  research. 

A  subsidiary  objective  of  the  paper  is  to  illustrate,  in  a  relatively  simple  and 
concrete  setting,  the  general  approach  to  deriving  lower  bounds  to  estimation  risk 
developed  by  Le  Cam  (1985,  for  example),  Ibragimov  and  Hasminskii  (1981),  and 
Birge  (1983).  This  method  relates  the  best  possible  speed  of  estimation  (in  a  given 
"global"  metric)  to  the  metric  entropy  structure  of  the  parameter  space.  We  need  a 
rrunor  modification  to  handle  the  present  indirect  estimation  setting,  introducing  a  form 
of  "modulus  of  continuity"  of  the  inverse  transform.  This  material  is  presented  mainly 
in  Section  5. 

There  is  a  substantial  literature  on  practical  algorithms  for  reconstruction  in  the 
PET  setting.  An  extensive  survey  covering  the  period  up  to  1979  is  given  by  Budinger, 
Gullberg  and  Heusman  (19  79);  this  includes  adaptation  of  methods  from  X-ray 


LT\  i 


transmission  tomography  and  the  orthogonal  scries  method  of  Marr  (1974).  Maximum 
likelihood  methods  were  proposed  by  Rockmore  and  Macovski  (1977);  they  were 
implemented  via  the  EM  algorithm  by  Shepp  and  Vardi  (1982)  (see  also  Vardi,  Shepp 
and  Kaufman,  1985)  and  modified  in  various  ways  to  incorporate  smoothing  by 
Geman  and  McClure  (1985)  and  Silverman  et  al.  (1988).  Some  practical  illustration  of 
the  orthogonal  series  method  introduced  in  the  present  paper  is  given  by  Jones  and 
Silverman  (1989).  A  recent  survey  of  algorithms  is  given  by  Tanaka  (1987).  Papers 
considering  noise  limitations  in  X-ray  and  transmission  tomography  include  Chesler  et 
al.  (1977)  and  Tretiak  (1978,  1979).  The  focus  of  these  papers  differs  from  ours  in 
that  they  consider  estimation  of  a  fixed  finite  number  of  real-valued  functions  of  a 
particular  unknown  intensity,  using  discrepancies  based  on  variance  rather  than  mean 
square  error. 


2.  Mathematical  model  and  technical  preliminaries 


2.1  An  idealised  problem  and  the  Radon  transform 


In  our  idealised  version  of  the  PET  problem,  the  ring  of  detectors  defines  a  slice 
of  the  patient’s  head,  and  the  reconstruction  aims  to  display  a  picture  of  the  glucose 
density  within  that  slice.  Emissions  that  give  rise  to  photon  pairs,  one  or  both  of 
which  miss  the  detector  ring,  will  go  unrecorded.  Bearing  this  in  mind,  we  shall 
regard  the  slice  as  a  plane  and  consider  an  essentially  two-dimensional  problem  where 
(see  Fig.  2.1)  emissions  take  place  in  the  plane  according  to  some  density  within  a 
detector  circle  taken  to  be  the  unit  circle  in  the  plane.  An  emission  at  P  gives  rise  to  a 
photon  pair  whose  directions  of  flight  lie  in  the  plane  along  a  line  l  through  P  with 
random,  uniformly  distributed,  orientation.  The  finite  size  of  the  detectors  is  ignored 
and  it  is  assumed  that  the  points  Q  and  R  of  the  intersection  of  /  with  the  detector 
circle  are  observed  exactly. 


Give  the  name  detector  space  to  the  space  D  of  all  possible  unordered  pairs  QR 
of  points  on  the  detector  circle,  and  call  brain  space  the  original  disc  B  in  the  plane 
enclosed  by  the  detector  ring.  Assume  that  coordinates  are  chosen  so  that  B  is  the  unit 
disc.  Brain  space  is  parametrised  either  by  cartesian  or  standard  polar  coordinates.  To 
parametrise  detector  space,  let  s  be  the  length  of  the  perpendicular  from  the  origin  to 
r  ^  i  -  the  detected  line  QR  as  in  Figure  2.2,  and  <p  the  orientation  of  this  perpendicular.  Thus 
'  D  is  ((s,<p):0<s<l,  0<<p<2;r}. 

_ _ y  We  now  define  dominating  measures  on  brain  space  and  on  detector  space. 

Define  a  measure  p  on  brain  space  to  be  ;r-1x  lebesgue  measure,  so  that 
dp(r,9)  =  K~xrdrdd  for  0<r^l  and  O<0<2;r  if  polar  coordinates  are  used,  and 
dp(x\  ,x2)  =  tz~'idx]dx2  for  IU||<1  in  Cartesian  coordinates.  On  detector  space,  define 
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a  measure  X  by  dX(s,<p)  =  2 n  2(1  —stydsdtp.  Both  n  and  X  integrate  to  1. 

Suppose  an  emission  takes  place  at  a  point  distributed  with  probability  density 
f{x t  ,x2)  with  respect  to  fi  in  brain  space.  Let  g  =  Pf  be  the  probability  density  in 
detector  space,  with  respect  to  X,  of  the  corresponding  detection  of  a  pair  of  photons, 
so  that  the  mapping  P  maps  the  actual  density  of  emissions  to  the  corresponding 
observable  density  in  detector  space.  We  shall  show  belcw  that  Pf  is  given  by 

Pf(s,<p )  =  i(l-s2)_i  f(s  cos  <p  -  t  sin  <p,  s  sin  <p  +  t  cos  q>)dt  (2.1) 


The  integral  in  (2.1)  is  the  so-called  Radon  transform  (see  Mair,  1974;  Deans,  1973) 
of  the  density  /,  namely  the  line  integral  of  /  along  the  line  /  with  co-ordinates  ( s,<p ) 
in  detector  space.  Since  the  length  of  the  segment  QR  is  2(l-s2)*,  it  can  be  seen  at 
once  that  Pf(s,tp)  is  the  average  of  /  over  the  part  of  /  that  intersects  the  detector  disc 
||r||^l.  If  /  is  the  uniform  density  in  brain  space,  so  that  /(jcj  ,jc2)  =  1  for  all  ||r||^l, 
then  we  will  have  Pf(s,tp)  =  1  for  ail  s  and  <p.  Thus  the  probability  measure  X  in 
detector  space  is  the  detector  space  distribution  corresponding  to  the  uniform  measure 
a  in  brain  space. 

It  remains  to  verify  (2.1).  Suppose  an  emission  takes  place  at  (X]  ,X2)  and  that 
the  corresponding  photon  pair  has  trajectory  at  angle  ¥  as  shown  in  Figure  2.3;  taking 
0<vF</r  for  definiteness,  the  joint  probability  density  with  respect  to  dx\dx2dyr  on 
IUH<1  and  0<ys<x  is  given  by 


fxt.Xi.'v  (Xi,x2,ys)  =  *'2/U x,x2) 


using  the  definition  of  a  and  the  fact  that  'F  is  independent  of  Xj  and  X2.  Now 
change  variables  by  setting 

S  =  locos'? +  X2sin'F| 

Jvp  if  X^os1?  +  X2sin*F  >0 
^  ”  jjF+tf  otherwise 

T  =  -Xjsin'F  +  X2cos'F  ; 

the  variables  (5,<3>)  are  the  coordinates  of  the  detected  photon  pair.  After  making  the 
transformation,  which  has  unit  Jacobian,  and  integrating  out  the  unobserved  variable  T, 
we  obtain  the  joint  density  with  respect  to  ds  dtp 

fs  <t>(s,<p )  =  „  2s  f(s  cos  q>  -  t  sin  <p,s  sin  q>  +  t  cos  <p)dt. 

The  density  (2.1)  with  respect  to  X  follows  at  once  from  the  definition  of  X. 
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2.1  Estimators  and  loss  functions 

In  this  section,  we  define  various  classes  of  estimator  of  f  that  we  shall  be 
considering,  as  well  as  two  measures  of  the  accuracy  of  estimation  of  /.  The  proofs  of 
the  three  propositions  stated  in  this  section  are  given  in  the  Appendix. 

Two  particular  classes  of  estimator  are  of  obvious  interest.  Let  T0(n)  be  the 
class  of  all  possible  estimators  based  on  a  sample  of  n  independent  direct  observations 
in  brain  space  from  the  density  /.  Let  T^/i)  be  the  class  of  all  estimators  of  /  based 
on  a  sample  of  n  indirect  observations,  i.e.  observations  in  detector  space  drawn  from 
the  density  Pf.  It  will  also  be  important  in  some  of  our  work  to  concentrate  attention 
on  those  estimators  that  are  linear  estimators.  An  estimator  /  based  on  observations 
Z(  is  called  linear  if  there  exists  a  weight  function  w(x,z)  such  that 

jw(x,z)dti(x)=l  for  all  z  in  the  space  of  the  observations,  and 

f(x)  =  n~l'^w(x,Zi)  for  all*  in  £.  (2.2) 

i=i 

Let  <TLD{n)  be  the  set  of  all  linear  estimators  based  on  a  direct  sample  of  size  n 
subject  to  the  additional  condition  |Jw(jc ,x')2dn(x)dn(x')<°°,  and  let  Tu(n)  be  the 
set  of  all  linear  estimators  of  /  based  on  an  indirect  sample  of  size  n  for  which 
JJw(x,y)24u(x)dLl(y)<oo.  The  additional  square  integrability  conditions  are  mild;  they 

A 

ensure  that  /  has  finite  mean  integrated  square  error  if  /  is  bounded. 

A 

One  natural  measure  of  the  accuracy  of  an  estimator  /  is  the  mean  integrated 
square  error  M(/;/)  =  Efj^(f-f)2dp.  By  standard  calculations, 

=  J[varff(x)  +  { Eff{x)-f{x))l]dp{x)  ,  (2.3) 

where  the  suffix  /  indicates  that  the  mean  and  variance  are  calculated  for  data  drawn 
from  /  in  the  direct  case  and  Pf  in  the  indirect  case.  We  define  the  surrogate  mean 

A 

integrated  square  error  M*(f;f)  by  replacing  the  variance  term  in  (2.3)  by  the 
corresponding  term  calculated  for  the  uniform  density  on  brain  space 

M*(/;/)  =  J[var1/(x)+  [Eff(x)-f{x)}2)dfi(x)  ,  (2.4) 

where  varj  denotes  a  variance  calculated  with  respect  to  data  drawn  from  the 
probability  measure  p  in  the  direct  case  and  A  in  the  indirect  case.  An  important 
relation  between  the  surrogate  and  the  true  mean  integrated  square  error  for  linear 
estimators  is  given  by  the  following  lemma. 

Proposition  2.1  Suppose  that  f  is  bounded  above  and  below  away  from  zero.  Then, 

A 

for  all  f  in  TLD(n)  or  in  T u(n) 

inffl/U)  2  M{f\f)lM *(/;/)  <  sup Bf(x). 


2.3  The  singular  value  decomposition  of  the  Radon  transform 

The  singular  value  decomposition  (SVD)  of  the  normalised  Radon  transform  P 
defined  in  (2.1)  is  the  key  to  our  study  of  the  loss  of  information  about  /  due  to 
indirect  observation.  To  establish  notation,  let  H  and  K  be  Hilbert  spaces  and 
P  :  H—>K  a  bounded  linear  operator.  Under  suitable  conditions,  there  exist 
orthonormal  sets  of  functions  [<pv]  in  H  and  {<^v)  in  K,  and  positive  real  numbers 
{bv},  the  singular  values  of  P,  such  that  the  [<pv]  span  the  orthogonal  complement  of 
the  kernel  of  P,  the  { y/v)  span  the  range  of  P,  and  P<pv=bvy/v  for  all  v. 

Thus  P  is  diagonal  in  the  bases  [<pv]  and  {yj.  If  a  singular  value  bv  is  small, 
then  noise  encountered  in  estimation  of  the  component  of  /  along  <pv  will  be  amplified 
by  a  factor  of  b~x.  Some  form  of  regularization  method  (Tikhonov  and  Arsenin, 
1977)  is  needed  to  deal  with  this  instability,  and  one  such  method,  based  on  tapered 
orthogonal  series,  will  be  exploited  in  Section  4  below. 

In  our  PET  model,  H  is  the  space  L*(B,n)  of  functions  on  brain  space  which  are 
square-integrable  with  respect  to  the  dominating  measure  p.  Correspondingly,  K  is  the 
space  L2{D  ,X)  of  detector- space  functions  square-integrable  relative  to  X.  Suppose 
that  X  -  (J^  ,X2)  is  drawn  at  random  (according  to  p)  from  brain  space  5.  If  a 
direction  <p  is  specified  by  uv  -  (cos  <p,  sin  <p),  then 

Pf(s,<p)  =  E{f{X)\u9-X  =  s) 

From  this  representation  it  follows  at  once  that  P  is  a  bounded  operator  from  L2(B,p) 
to  L2(D,X)  with  norm  1  and,  by  arguments  involving  characteristic  functions,  it  is 
one-to-one.- 

The  SVD  of  the  Radon  transform  in  this  specific  setting  appears  to  have  been 
first  derived  by  workers  in  optics  and  tomography;  we  now  review  its  properties 
drawing  material  from  Bom  and  Wolf  (1975,  Chapter  9.2.1  and  Appendix  VII),  Marr 
(1974)  and  beans  (1983,  Section  7.6).  Since  the  underlying  spaces  are  two 
dimensional,  we  need  double  indices,  specifically  v  e  N  = 
{(/,m)  :  m  -  0,1,2,...;  /  =  }.  In  brain  space,  an  orthonormal  basis 

for  L2(B,y.)  is  given  by 

<pv(r,9)  =  (m+l)izlll(r)elle  v  =  (/,m)  e  N,  {r,9)  e  B,  (2.5) 

where  denotes  the  Zernike  polynomial  of  degree  m  and  order  k.  Zemike 
polynomials  satisfy  the  orthogonality  relation  jQ^k+2s(r)^k+2t^r^r  dr  = 
^(l:-t-2j-Hl)-1<5jl,  and  can -be  expressed  in  terms  of  the  more  general  family  of  Jacobi 
polynomials.  They  arise  naturally  from  a  study  of  the  action  of  rotation  on  L2(B,p). 

The  corresponding  orthonormal  functions  in  L'-(D.A)  are 

=  Um(s)ell<p  v  =  ( Lm )  €  N,  (s,<p)  s  D  (2.6) 
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where  Um( cos  9)  =  sin  (m+ 1)9/  sin  9  are  the  Chebychev  polynomials  of  the  second 
kind.  We  have  P<pv  =  bvyrv,  with  the  singular  values  bv  =  b^  specified  by 

b„  =  (m+iri  v  =  (l,m)sN  .  (2.7) 


The  relatively  slow  decay  of  the  singular  values  with  degree  m  (independently  of  /) 
suggests  that  the  costs  of  indirect  observation  in  the  PET  problem  are  not  inordinately 
large. 


Since  we  work  with  real  densities  /,  we  may  identify  the  complex  bases  (2.5)  and 
(2.6)  with  equivalent  real  orthonormal  bases  in  a  standard  fashion.  For  example 
/  =  L/>v  =TJvVv  where 


<Pi.m=  I 


V2Re( <pl>m)  if />0 

<P0.m  if  1=0 
V2Im (<pt  m)  if /<0 


and  similarly  for  the  real  coefficients  fl  m.  From  now  on,  we  suppress  the  tildes  in  the 
notation  and  use  whichever  basis  is  convenient. 


2.4  Smoothness  classes 

In  our  subsequent  analysis,  we  place  constraints  on  the  unknown  density  /  over 
brain  space  by  assuming  it  lies  in  a  particular  class  7 .  For  reasons  of  mathematical 
tractability,  this  class  is  taken  to  be  a  particular  ellipsoid  7  in  the  Hilbert  space 
H  =  L2(B,p),  specified  by  an  array  of  constants  [av]  and  a  threshold  c: 

7  =  (/  =  Zfv<Pv  ■  Z  a2f2Zc}.  (2.8) 

Ellipsoid  conditions  can  amount  to  the  imposition  of  smoothness  and  integrability 
requirements.  For  example  in  the  simple  case  where  { <pv }  is  the  sequence  of 
trigonometric  polynomials  on  a  bounded  interval  [0,2a:]  in  one  dimension  and 
av~v~p>  Z avfv  <  00  if  and  only  if  the  periodic  function  /  has  p  square-integrable 
derivatives  on  the  interval. 

To  describe  specific  ellipsoids  in  the  PET  problem,  it  is  useful  to  transform  the 
index  set  N  by  the  change  of  variables  j  =  {m+l)/ 2,  *  =  (|»-/)/ 2  into  the  lattice 
orthant  N'  =  {j,k)  :  j  >  0 ,k  >  0}.  Using  the  real  version  of  the  basis  [<pv},  let 

V  8  l/etf  :/oo  a  1.  Z  O'+iy^+l^i  ^  1+C2).  (2.9) 

This  set  is  characterised  by  the  following  proposition. 

Proposition  2.2:  The  function  f  in  H  lies  in  some  7p  C  if  and  only  if  f  has  p  weak 
derivatives  that  are  square  integrable  on  B  with  respect  to  the  modified  dominating 
measure  dpp+l(x)  =  (p+l)(l-||x||2y7d)i(.r). 
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The  condition  derived  in  Proposition  2.2  is  of  course  somewhat  weaker  than  requiring 
square-integrability  with  respect  to  p  and  the  reason  for  the  modification  of  the 
dominating  measure  is  discussed  in  the  proof;  a  similar  technical  phenomenon  occurs 
in  Cox  (1988).  Nevertheless,  7p,c  can  be  regarded  as  imposing  a  set  of  smoothness 
and  integrability  conditions  :  the  higher  p  is,  the  smoother  are  the  functions  allowed  in 

How  smooth  are  the  functions  that  we  are  trying  to  reconstruct?  In  X-ray 
transmission  tomography,  there  may  be  discontinuities,  or  at  least  sharp  jumps,  in 
tissue  density  across  the  boundaries  of  various  regions.  As  noted  by  Natterer  (1980, 
1986),  functions  that  are  piecewise  smooth  with  jumps  only  along  smooth  curves  lie  in 
Sobolev  spaces  corresponding  to  p  <  \  square  integrable  (fractional)  derivatives.  In 
emission  tomography,  with  its  inherently  lower  resolution,  it  may  perhaps  be 
reasonable  to  postulate  somewhat  smoother  emission  densities  of  the  labelled 
metabolite.  In  any  case,  our  theory  is  presented  for  arbitrary  values  of  the  smoothness 
p  >  0  wherever  possible. 

To  ensure  that  elements  of  7p,c  ^  b°na  probability  densities,  some  further 
restrictions  are  needed.  To  have  total  mass  1,  we  require  /qq  =  1.  By  restricting  the 
constant  C  that  governs  the  ellipsoid  size,  we  can  ensure  that  f(x)  >  0.  This  is  a 
consequence  of  the  following  proposition. 

Proposition  2.3:  Suppose  p  5  1  and  f$7p  c-  Then 

sug  |/U)  -  1|  <  CI"-p}I2.  (2.10) 

Equality  is  attained  in  (2. 10)  if  f  is  a  linear  function  of  x. 

It  follows  from  the  proposition  that  7p,c  will  be  a  class  of  nonnegative  functions  on  B 
if  and  only  if  C  <  2<-p~x^2.  Note  also  that  if  g  =  Pf,  then 

sup  ig(y)-lj  <  sup  \  f{x)- 1 1  (2.11) 

y  x 

since  P  is  an  averaging  operator. 

3.  Main  conclusions  of  the  paper 
3.1  Arbitrary  estimators 

We  use  minimax  mean  integrated  square  error  as  our  basic  approach  to  the 
quantification  of  the  information  available  in  a  given  sample.  The  maximum  is  taken 
over  .a  smoothness  class  7Ptc  °f  unknown  functions  /,  and  the  minimum  is  then  taken 
over  a  class  of  estimators  T,  whose  specification  of  takes  account  of  whether  the 
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sample  is  "direct"  or  "indirect".  We  define  the  various  classes  of  estimators  as  in 
Section  2.2  above,  and  the  smoothness  classes  as  in  Section  2.4. 

Suppose  we  have  a  sample  from  a  density  /  and  an  estimator  /  of  /  based  on  that 
sample.  An  assessment  of  the  accuracy  of  /  that  does  not  depend  on  a  particular 
unknown  /  can  be  obtained  by  merely  restricting  /  to  lie  in  a  fixed  class,  for  example 
Jp  c  for  some  fixed  p  and  C,  and  finding  the  maximum  mean  integrated  square  error 

R(J)  =  sup  M(/;/)  .  (3.1) 

f^Tr.c 

The  maximum  risk  gives  an  indication  of  how  well  any  given  estimator  will  perform, 

A 

but  a  large  value  of  R(J)  might  indicate  either  that  there  is  not  much  information  in 
the  sample  or  that  an  inefficient  estimator  is  being  used.  Because  we  are  interested  in 
the  experiment  itself  rather  than  any  particular  estimator,  we  consider  the  minimum 

A  A 

value  of  R(f)  over  suitable  classes  of  estimators  /. 

Define 

rD(n)  =  a  inf  R(J)  (3.2) 

/€T0(n) 

and 

r,(n)  =  inf  R(J)  .  (3.3) 

/eTKn) 

These  minimax  risks  quantify  the  information  about  the  unknown  density  inherent  in 
''direct"  and  "indirect”  data  sets  of  size  n,  in  a  manner  that  is  independent  of  the 
method  of  estimation.  Comparing  their  relative  values  gives  an  indication  of  how  much 
information  is  lost  because  data  can  only  be  observed  indirectly  in  practice. 

We  can  now  state  our  first  main  result,  which  gives  exact  orders  of  magnitude  for 
rD(n)  and  r,(n)  for  fixed  p  and  C.  The  condition  places! on  C  is  precisely  that  needed 
to  ensure  that  all  elements  of  7p,c  are  positive  probability  densities.  Here  and 
subsequently  we  use  the  notation  an=bn  to  mean  that  the  sequences  {an}  and  [bn] 
satisfy  mfn{ajbn)>0  and  supn(aJbn)<°o, 

Theorem  3.1:  For  fixed  p>\  and  0<C<2i(/,'!),  with  the  definitions  (3.1)  to  (3.3), 

'o('t)  =  (log  n/n)pt(P+l).  (3.4) 

and 

rl(n)  =  (l/n)p/{p+2) .  (3,5) 

The  proof  of  Theorem  3.1  is  given  in  Sections  4  and  5  below.  It  car.  be  seen 
from  (3.4)  and  (3.5)  that  the  effect  of  the  indirect  nature  of  the  observations  taken  in 
practice  is  to  reduce  somewhat  the  rate  at  which  the  minimax  risk  converges  to  zero. 
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Suppose,  for  example,  p=i,  corresponding  to  /  having  square-integrable  first  weak 
derivatives.  Then  (neglecting  the  logarithmic  term)  the  rate  is  reduced  from  n~m  to 
n  by  taking  indirect  rather  than  direct  observations.  Note  that  both  these  rates  are 
slower  than  the  n~l  rate  usually  obtained  for  mean  square  error  in  parametric 
statistics;  this  is  because,  even  with  the  restriction  that  /  lies  in  7p,c,  the  space  of 
possible  parameters  is  infinite  dimensional. 

Theorem  3.1  also  leads  to  some  qualitative  conclusions  about  equivalent  sample 
sizes.  Define  the  equivalent  sample  size  m(n )  to  a  given  indirect  sample  size  n  to  be 
the  number  of  emissions  knowledge  of  whose  original  positions  in  the  brain  would 
allow  us  to  estimate  /  with  the  same  minimax  accuracy,  so  that 

rD(m(n))  =  r,(n)  .  (3.6) 

Some  simple  algebra  from  (3.4)  and  (3.5)  yields  the  order  of  magnitude  of  the 
equivalent  sample  size  as 

m(n)  -  rt(P+1)/^+2)logn.  (3.7) 

Perhaps  not  surprisingly,  the  order  of  magnitude  of  the  equivalent  sample  size 
depends  on  the  smoothness  assumptions  made  on  the  density  /.  The  smoother  /  is 
assumed  to  be,  the  larger  will  be  the  index  p.  Hence  for  very  smooth  densities  the 
power  in  (3.7)  will  be  close  to  1  and  little  will  be  lost  as  a  result  of  the  indirect  nature 
of  the  observation  process.  However,  in  reality,  we  ought  not  to  assume  that  the  true 
emission  density  necessarily  varies  very  smoothly,  since  tissue  boundaries  and/or 
localised  areas  of  high  metabolic  activity  may  lead  to  discontinuities,  certainly  in  high 
derivatives  of  /  and  possibly  in  /  itself. 

3.2  Linear  estimators 

More  precise  numerical  quantitative  conclusions  cannot  be  drawn  directly  from 
(3.7),  because  Theorem  3.1  only  gives  orders  of  magnitude  for  the  relevant  risks.  We 
are  able,  however,  to  give  explicit  approximate  numerical  equivalent  sample  sizes  for 
minimax  risks  calculated  restricting  attention  to  linear  estimators  and  using  as  a 
measure  of  error  the  surrogate  mean  integrated  square  error  M*  defined  in  (2.4).  By 
analogy  to  (3.2)  and  (3.3)  define  surrogate  linear  minimax  risks  r^n)  and  rfa(n)  by 

rLDM  =  „  inf  sup  M *(/;/)  (3.8) 

/eTu,(n)  /sT'j.c 

and 

r[j{n)  =  ^  inf  sup  ■  (3.9) 

/e ru(n)  {*7p.c 

The  second  main  result  gives  leading  terms  of  asymptotic  expansions  for  r ^  and 
ru-  The  leading  orders  of  magnitude  are  exactly  the  same  as  those  given  for  the 
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corresponding  quantities  in  Theorem  3.1,  and  so  the  restriction  to  linear  estimators 
does  not  affect  the  rates  of  convergence  available.  All  the  constants  Cj  depend  only  on 
the  smoothness  p  and  are  collected  in  Table  1.  One  of  our  reasons  for  introducing 
surrogate  mean  integrated  square  error  is  that  we  have  been  able  to  derive  these  more 
precise  expressions,  and  hence  obtain  numerical  results.  The  other  reason  is  that  the 
result  of  Theorem  3.2  is  a  key  step  in  the  proof  of  Theorem  3.1. 


Theorem  3.2:  For  x  >  1,  let  a(x)  denote  the  solution  to  a  log  a=x,  and  set 

r/r1  =  crlcc(c2nCz).  (3.10) 

Then,  provided  0<C<2^P_1), 

r£>(«)  =  c3/t“177a(logr7n+^4)  +  0(n~lrj *)  (3.11) 

=  CjC^^aogn/ny^0!!  +  <?(!)}  (3.12) 


and 


r'u{n)  =  C(>Cil{p+l)n-pHp+2)  +  tf(,r^+iy^2>logn)  .  (3.13) 


The  form  (3.12)  for  is  more  transparent,  but  the  error  term  can  be  shown  to  have 
the  same  polynomial  order  as  the  leading  term;  the  error  term  in  (3.11)  is  of  lower 
order  and  so  we  use  (3.11)  in  numerical  computations.  Of  course,  a{x)  can  be  found 
numerically  when  required  and  is  asymptotic  to  x/\ogx  for  large  x. 


For  any  particular  indirect  sample  size  n,  the  approximate  equivalent  sample  size 
m*(n)  can  be  found:  equate  the  expressions  (3.11)  for  r£p(m*)- and  (3.13)  for  r[j(n), 
neglect  the  lower  order  terms,  and  solve  numerically  for  m*.  For  definiteness  we  take 
C2=2p~l,  the  largest  value  for  which  all  /  in  fp  C  are  non-negative  probability 
densities,  so  long  as  p2 1  (Proposition  2.3).  Some  representative  cases  are  given  in 
Table  2.  As  expected,  the  equivalent  sample  size  increases  as  the  assumed  amount  of 
smoothness  rises.  If  technology  allows  an  order  of  magnitude  increase  in  the  amount 
of  data  collected,  then  the  equivalent  direct  sample  sizes  increase  by  a  factor  of 
between  5  and  8,  this  factor  itself  increasing  with  assumed  smoothness. 


For  the  quantity  m*{n)  the  asymptotic  constant  of  proportionality  in  the 
expression  corresponding  to  (3.7)  can  be  found.  A  simple  calculation  uses  relations 
(3.12)  and  (3.13),  with  the  error  terms  ignored,  to  conclude  that 

m*(n)  =  (p+l)(p+2r1(c5/c6)(p+1)^C“2^+2)n^+1)^+2)logn(l+o(l)}. 


In  summary,  our  results  confirm  intuition  that  for  the  PET  problem,  the  amount 
of  information  available  is  still  substantial,  but  it  is  by  no  means  as  great  as  if  a 
sample  of  the  same  number  of  direct  observations  were  available. 
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4.  Convergence  rates  for  linear  estimators 

The  main  aim  of  this  section  is  to  prove  Theorem  3.2,  which  gives  the  asymptotic 
behaviour  of  the  surrogate  risks  (2.4)  for  linear  estimators.  It  is  a  consequence  of 
Propositions  2.1  and  2.3  that,  provided  C<2^p~l\  the  ratio  exact  to  surrogate  mean 
integrated  square  error  for  linear  estimates  will  be  bounded  above  and  below  away 
l  from  0  uniformly  over  7p,c-  Since  and  T[j(n)  are  subclasses  of  TD(n)  and 

* T[(n )  respectively,  it  then  follows  that  the  orders  of  magnitude  of  rD(n)  and  rfin)  are 
bounded  above  by  those  obtained  in  Theorem  3.2  for  surrogate  linear  minimax  risks. 
Once  Theorem  3.2  has  been  proved,  the  proof  of  Theorem  3.1  will  be  completed  in 
Section  5  by  showing  that  these  are  also  lower  bounds. 

4.1  Structure  of  the  linear  minimax  estimator 

We  consider  the  indirect  case  first;  the  argument  we  shall  use  will  apply  to  the 

A 

direct  case  also.  We  start  by  defining  some  notation.  Suppose  that  /  is  in  TL[{n). 
For  v  and  r  in  /V  define  =  j  w{x,y)tpv{x)\f/^{y)  dn(x)  dX(y);  because  of  the 
condition  j  jw 2 dudl  <  <»,  standard  functional  analysis  gives  that,  in  the  L2  sense, 

w(x,y)  =  XZ  <pv(x)iyx(y).  (4.1) 

V  K 

As  in  Section  2.3  and  2.4,  we  expand  /  as  Lfv<pv.  We  write  W  for  the  infinite  matrix 
(*V*)  and  f  for  the  vector  (/„).  The  index  set  of  all  vectors  and  matrices  will  be  the 
set  N\  the  subscript  (0,0)  will  be  written  as  0  for  simplicity.  Since  jfdfi  =  1  the 
coefficient  /0  =  1.  Write  B  =  diagfbj,  the  singular  values  of  the  operator  P.  Let  ev 
be  the  vector  (5^  :  n  eN).  The  first  lemma  gives  a  matrix  form  for  the  surrogate 
mean  integrated  square  error  of  the  linear  estimator  /. 

Lemma  4.1  With  the  above  definitions, 

M*{f ;/)  =  n~x  tr W(/-e0e0T)W  +  fT(I-WB)T{I-WB)f  (4.2) 

Proof  Write  /  =  YJv<pv.  From  (4.1)  it  follows  that  f=Wrj  where 
=  tt-1  XlMlU  Each  Y,  has  density  g  =  where  g  =  fif,  and  for  each  v 

i 

Efdv  =  \  VvgdJ.  =  gv ,  so  that  Eft]  =  fif.  Hence  Ej-f  =  WBf,  and  the  integrated  square 
bias 

!  (Eff-f)2  dn  =  ||  £f-f|| 2  =  ||  WBf- f || 2  =  ||  (/-  Wfi)f|| 2  (4.3) 

If  /  is  the  uniform  density,  then  f  =  e0  and  so,  writing  £;  for  an  expectation  relative 
to  the  uniform  density/,  £j 77  =  Se0  =  e0  since  b0  =  1. 

By  the  orthonormality  of  the  y/v,  the  matrix  E{riTjr  =  n~[i  and  so  17  has 
covariance  matrix  n_1(/-e0e0T)  under  the  uniform  distribution.  Thus  the  surrogate 
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variance  term 


JvarJtyi  =  Ejlf-  Ei?!2  =  £1||W(j7-E1tj)||2  =  /r1trW(/-e0e0T)WT  (4.4) 

by  a  standard  multivariate  calculation.  To  complete  the  proof,  substitute  (4.3)  and 
(4.4)  into  the  definition  (2.4)  of  surrogate  mean  integrated  square  error. 

□ 

Our  second  lemma  provides  an  expression  for  the  surrogate  linear  minimax  risk 
and  gives  the  general  form  of  the  minimax  estimator.  The  smoothness  class  7  is 
defined  as  in  (2.8)  and  (2.9)  to  be  J  =  {/  :  /0=1 ,  fTAf  <,  1  +  C2}  where  we  write 
A  =  diag(tz2),  and  assume  that  Oq  =  1  .sup a2  =  and  that  every  /  in  7  is  non¬ 
negative. 

Lemma  4.2 

inf  sup  W*(/;/)  =  n-1  ^b;2(l-avyi)+  (4.5) 

where  y  is  chosen  to  ensure  that 

n~x  X  b~2a2{y~^a~x  -  1)+  =  C2 .  (4.6) 

v*0 

The  minimax  estimator  is  given  by  setting,  in  (4. 1), 

tv„  =<5v*  for  v=0  and  =  5^  b^il-y^a^  otherwise.  (4.7) 


The  form  of  the  minimax  estimator  is  worth  noting,  since  it  corresponds  to  a 
diagonal  matrix  of  weights  and  hence  is  an  estimator  of  the  form 
fix)  =  n~x  XV*  uv  Vv(Yi)<Pv(x)-  Although  the  derivation  of  the  estimator  has  been 
performed  for  theoretical  reasons,  some  examples  of  the  use  of  estimators  of  this  kind 
are  given  by  Jones  &  Silverman  (1989).  Similar  results  to  Lemma  4.2  exist  for 
standard  regression  (for  example  Pinsker,  1980,  Speckman,  1985)  and  for  other 
nonparametric  problems  (for  example  Buckley  et  al„  1988).  Our  proof  is  an  extension 
of  that  of  Speckman  (1985,  pp.98 1-982). 

Proof  of  Lemma  4.2  The  condition  jw(x,y)d^.(x)  =  1  for  all  y  implies  that  wqq  =  1 
and  w0v  =  0  for  v  ^  0.  Let  W  be  the  set  of  matrices  W  satisfying  this  condition  and 
for  which  X  X^w  <  the  matrices  W  in  TV  correspond  precisely  to  the  estimators  / 
in  Tuin).  We  use  Lemma  4.1  and  find  the  minimax  value  of  the  expression  (4.2) 
over  W  in  W  and  /  in  7.  Let 


J(W)  =  sup 
fe'I 


|||(/-W5)f||2  +  n~x  trW(/-e0e0T)WT 


(4.8) 


Let  W°  be  the  matrix  diag(wvv);  we  show  that  J(W)  >  and  hence  that  we  may 

restrict  attention  to  diagonal  matrices  in  W. 
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For  fixed  k  in  N,  ic  $  0,  let  JK  be  the  set  {/  =  1  +  fK  <pK,a%f 2  £  C2 }.  Then 

sup  ||(/-WB)f||2  =  sup  XC^vO+^vr^r/r-Zv)2  ^  SUp 
7.  7.  v  7, 

>  (1  -wKKbK)2C2/a2,  (4.9) 

by  picking  out  the  <  term  from  the  summation  and  performing  some  elementary 
algebra.  Again  by  restricting  the  sum,  we  have 

trWU~eo^T)WT  =  X  X  *  I  (4-l0) 

*40  *40  *4° 

Restricting  the  supremum  to  /  in  UiFr,  and  substituting  (4.9)  and  (4.10),  we  obtain 

J(W)  >  s\ip(.l-wKKblc)1C1la2  +  n~x  X  wl*  ~  J(w°)  (4.11) 

^40 

by  checking  that  every  inequality  in  our  argument  is  an  exact  equality  when  W  is 
diagonal. 

Let  y  -  supr^o(l  —  wKKbK)2  a*2-  Now  reason  from  (4.11)  as  in  Speckman  (1985) 
to  obtain  (4.7);  then  substitute  into  the  expression  for  J{W°)  in  (4.11)  and  minimize 
over  y  to  complete  the  proof. 

[ 

To  obtain  corresponding  results  for  the  direct  case,  set  the  operator  P  to  the 
identity  in  the  whole  of  the  preceding  argument.  The  minimax  surrogate  risk  rj^(n)  is 
given  by  (4.5)  and  (4.6)  with  all  bv  set  to  1.  The  minimax  estimator 
rz-1  X*V*  <Pv(Xi)9y{x)  is  a  probability  density  estimate  of  tapered  orthogonal  series 

i,v 

form  as  introduced  and  studied  by  Watson  (1969). 

4.2  Integral  approximation  of  the  minimax  risks 

In  this  subsection  we  explicitly  approximate  the  expression  (4.5),  and  the 
corresponding  expression  for  the  direct  case,  to  complete  the  proof  of  Theorem  3.2. 
We  set  7-7p_c  as  in  (2-4)  so  that  a2k=(i+\Y {k+\Y .  The  key  to  our  treatment  is  the 
following  approximation  lemma,  obtained  by  approximating  sums  by  integrals. 

Lemma  4.3  For  any  rj,  let  X^)  denote  a  sum  over  {(j,k):  l<(j+\)(k+\)<ri}.  For 
fixed  r>  0,  as  77— 

X(„)  (j+lY(k+l)r  =  (r+irV+1(iog77+2y£-(r+ir1}  +  0(  77'^)  (4.12) 

where  yE  is  Euler’  >  constant,  and 

Z^)(j+k+D(j+mk+l)r  =  j^:(7-+2r177r+2  +  (9(77r+1  log  77)  .  (4.13) 
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Proof  For  the  proof,  we  transform  the  sums  by  replacing  /'+ 1  by  j  and  k+ 1  by  t, 
denote  by  the  sum  over  the  transformed  range  {(j,k) :  j'^.l,k'klandl<jk^T]}. 
By  symmetry  in  ( j,k ),  the  sum  in  (4.12)  satisfies 

b‘]  In*'  M  (*‘11**1 

■ i  r-ziric'. 

k—l  i=l  j=lk=l 

From  the  relation  £ y'-  =  (r+l)~V+1  +  <9(rr),  we  obtain 
>=i 

S  =  2(r+irll£kr[[Tik-lY+l+0(ijrk-r))  -  {(r+iyH^Y^+Oi^2))2  -  1 

k=  1 

[  rjt]  [^i] 

=  2(r+l)"V+1  -  £ik'{(ij4-1)r+1-[r;ik-1r’1l  -  (r+l)'V+1  +0(7]'^) 

i=l  t=l 


=  2(r+irV+1{}log77+y+0(7ri)}  -  (r+irV+1  -r  0( vrH), 


(4.14) 


which  yields  the  result  of  (4.12). 

To  deal  with  (4.13),  we  need  an  integral  approximation,  valid  for  s>0  and  j:>1. 


ZjS  =  (s+irlxs+l  +  Cax\ 


(4.15) 


w 


which  follows  from  the  bounds  (s+ l)"1  <  jQX  cs  ds  £  £y*  <  j^x+  ts  ds 

;=1 

<  (^-t- 1)“ 1  [jch-  1  .  Assuming  that  r\  is  an  integer,  it  then  follows  that 

*1  [  |  rj  jj 

l+z  (ff]/+l*r  =2^1  rx  =£r(r+2)-1(T7rIr2+£i^+1,^(^-lr1 

*=1  y=l  i=l  *=1 

=  (r+2)-177r+2i/t-2  +  0{T1r+lZk-1) 


=  ~ 7T 2( r+2)-1 77 r+2  ■+■  0{T\r*x iog 77 ) . 
6 


(4.16) 


To  complete  the  proof  of  (4.13),  transform  the  sum  to 
Hln)(Jr+[ kr  +jr kr+l -jrkr).  Then  substitute  (4.16)  for  each  of  the  first  two  terms, 
and  use  (4.15)  to  absorb  the  third  term  into  the  error.  □ 

Completion  of  proof  of  Theorem  3.2.  We  will  have  ya 2  <  1  if  and  only  if 
{j+l)(k+\)  <  y~x!p  and  so  the  (  )+  in  (4.6)  and  (4.7)  may  be  replaced  by  (  )  if  the 
sums  over  all  v  are  replaced  by  £(f))  with  77  =  y~x^p .  The  constants  cr  will  be  defined 
as  in  Table  3.2.. 

In  the  direct  case,  we  replace  ‘TL[  in  (4.5)  by  ‘JLD  and  set  all  bv  to  1.  Applying 
(4.12),  equation  (4.6)  becomes 
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C2  =  n'1  I(,)(y_iav-av2) 

=  «"1  Z(7)  {rkj+if'Hk+if12  -  (y+im+iy’} 

=  n-lT7p+l(c7log  rj+Cg)  +  n_10(r?p+*).  (4.17) 

The  substitution  p  =  c^yx^p+xx  reduces  equation  (4.17)  with  the  error  term  omitted  to 
the  form  y  logy  =  C2/1C2;  it  follows  that  as  defined  in  (3.40)  is  the  solution  for  r?  of 
this  equation.  Apply  similar  manipulations  to  (4.5)  to  obtain 

rio(n)  =  n“1L(7)(1-yiav)  =  c3n~lr7„{log  h„+c4 }  +  n~xO{r]\), 

completing  the  proof  of  (3.11).  To  prove  (3.12),  substitute  the  definition  of  ijn  into 
(3.11),  and  use  the  fact  that  a(jc)=(;t/log;c){  l+o(l)}  for  large  x. 

For  the  indirect  case,  we  use  the  values  (2.7)  for  the  bv.  Equation  (4.6)  then 
becomes 

c2  =  1(7)  {j+k+v{Y-iu+iy,2U‘+iyl2-u+up(i‘+up) 

=  c9n~XT 7P+2  4-  rt-!<3(77p+1Iog 77).  (4.18) 

where  c9  =  (z2/3)p(p+2)~l(p+4yl .  Set  f}n=(nC2/cg)l^p+2\  the  solution  to  (4.18) 
with  the  error  omitted.  Then  the  solution  to  (4.18)  with  the  error  included  satisfies 
r\=f\n  +  (9 (log  Hn).  Substitute  back  into  (4.5),  apply  Lemma  4.3,  and  perform  some 
elementary  algebra  to  obtain  (3.13),  and  hence  to  complete  the  proof  of  Theorem  3.2. 

□ 

To  summarise  this  section,  we  have  shown  tnat,  for  linear  estimators,  the  indirect 
nature  of  the  PET  observations  reduces  the  minimax  rate  of  consistency  in  mean 
integrated  square  error  from  O  {(n/log  n)~p^p+X) }  to  0(n~p^p+2)).  It  will  be  shown 
in  the  next  section  that  these  rates  of  consistency  are  both  best  possible  even  if  we 
allow  the  class  of  estimators  to  be  extended  to  cover  all  linear  and  non-linear 
estimators. 


5.  Lower  bounds 

In  this  section  we  establish  lower  bounds  on  the  rates  of  consistency  of  arbitrary 
estimators  based  on  direct  and  indirect  observations.  These  lower  bounds  show  that 
the  minimax  rates  obtained  for  linear  estimators  in  Section  4  cannot  be  improved  by 
extending  the  class  of  estimators  considered.  As  noted  at  the  beginning  of  Section  4, 
this  will  complete  the  proof  of  Theorem  3.1. 
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5.1  Moduli  of  continuity  and  a  general  lower  bound  for  global  norms 

Our  approach  is  based  on  Fano’s  lemma  of  information  theory,  as  developed  by 
Ibragimov  and  Hasminskii  (e.g.  1981)  and  Birge  (1983),  although  a  slight  extension  of 
Birge’s  formulation  is  needed  for  the  indirect  observation  case.  Although  we  continue 
to  focus  on  the  PET  example,  it  will  be  seen  that  the  methodology  applies  quite 
generally  to  estimation  with  global  norms  in  linear  inverse  problems  of  both  density 
and  regression  estimation  type. 

The  convergence  rate  in  the  indirect  problem  clearly  depends  on  the  operator  P~l 
mapping  the  observable  density  g  to  the  target  density  /.  One  convenient  approach  to 
computing  convergence  rates  has  two  parts:  (i)  compute  a  "modulus  of  continuity” 
r(e)  for  P~l ,  and  (ii)  argue  that  a  lower  bound  to  the  minimax  convergence  rate  is 
given  by  (essentially)  r(/i“i).  This  approach  separates  stochastics  and  analysis:  step 
(ii)  uses  the  information  theory  lemma  to  bound  the  estimation  error  by  r(n~*)  while 
step  (i)  is  a  concrete  optimisation  problem  for  the  particular  operator  in  question.  This 
viewpoint  was  taken  recently  by  Donoho  and  Liu  (1989)  in  their  study  of  estimation  of 
linear  functionals.  We  begin  with  step  (ii),  which  computes  a  modulus  c t(S)  which  is 
more  convenient  for  the  problems  at  hand.  We  return  to  step  (i)  in  Section  5.2  below. 

Suppose,  in  general,  there  are  available  n  i.i.d.  observations  'Y(n)  =  (Tj . Yn) 

from  a  density  g(y)dA(y),  yeD,  and  that  we  wish  to  estimate  /  =  P~lg.  We  assume 
that  fefciH,  and  that  J  is  a  translate  /°+tf0  °f  a  set  H0  that  is  balanced  about  the 
origin  (he/f0  =o-/i€//0).  Let  M  be  a  finite-dimensional  subspace  of  H :  we  write  \M\ 
for  the  dimension  of  M  and  £M(<5)  for  the  open  ball  of  radius  8  about  0  in  M.  The 
norm  of  the  restriction  of  P  to  M  is  defined  by  ||P||iW  =  sup{ |] PA ||/ 1| /i |]  :  h&M). 
Finally,  let  Ms  =  [M  :  Bw(5)  c  H0).  The  modulus  <7(5)  may  now  be  defined  as 

o(8)  =  Sm\\P\\J\M\i:MeMs}.  (5.1) 

Loosely  speaking,  <7(5 )  measures  the  decay  of  the  singular  values  of  P  relative  to  the 
parameter  space  Hq  at  resolution  8.  Since  o  is  strictly  increasing,  a  left-continuous 
inverse  r(e)  =  cr-1(e)  can  be  defined. 

Let  /eT/(/i)  be  an  arbitrary  estimator  based  on  The  significance  of  the 

modulus  functional  is  that  an  (often  sharp)  lower  bound  for  the  rate  of  convergence  of 
11/  _  /II  over  J  is  given  by  r (n~^).  For  the  proof  we  need  an  additional  assumption 
bounding  the  Kullback-Leibler  information  divergence  K(ga,gp)  =  j  log (galgff)  gadX 
over  g  =  PJ : 

For  some  A<°o,  K(ga,g0)  <  A \ga-gfi  ||*  for  all  ga,gfie$.  (5.2) 

This  condition  will  be  satisfied  provided  the  densities  g  in  g  are  uniformly  bounded 
above  and  below  away  from  zero.  In  the  context  of  Theorem  3.1,  this  is  a 
consequence  of  (2.11)  and  (2.10). 


(5.3) 


Proposition  5.1  If  condition  (5.2)  holds,  there  exist  constants  d\  ,di  such  that 

„  inf  sup  Ef  ||/-/||j)  ^  d1r2(^rt-i). 

/«*#(*) 

Proof  Choose  a  subset  =  [fx . fr)  c  7  that  is  26-distinguishable  :  namely 

l|/a  -  /* II  >  25  if  a+j8.  Set  =  P/a  and  write  Kn(ga,g0)  =  njlogfg*/^),?^,  the 
Kullback-L  ibler  discrepancy  based  on  a  sample  of  size  n. 

Consider  the  discrimination  problem  of  choosing  among  the  r  hypotheses 
Given  an  estimator  /sT^/i),  define  a  discrimination  rule  <p(Y ^)  taking  values  in  ‘J® 
that  picks  the  closest  element  in  to  /.  Then,  by  elementary  probability  and  analysis, 

sup  Ef\\f-f\\2  >  sup  Ef\\f-f\\2  >  52  sup  P/(\\f-f\\>S) 
fey  3  /67°  y  /err0 

>  <52r-1  £  P/.(||/-/all><5)  S  52r'1  £  Pfa  { <p(Y<n'Wa)  ■  (5-4) 

a=l  ar=l 

since  «p(y(n))^/a  implies  that  ||/-/a||>5,  because  of  the  25-dis..  iguishability. 

By  Birge’s  version  (1983,  p.196)  of  Fano’s  lemma,  tne  average  error  rate  in  the 
discrimination  problem  can  be  bounded  below  as  follows: 

r~l  t  PfS<P(.Y(n))*fa)  >  1-  {i<sup^^',(ga,^)  +  log2}/log(r-l).  (5.5) 

Combining  (5.4)  and  (5.5),  and  substituting  (5.2),  we  obtain  the  lower  bound 

<T2sup  EA\f-f\\ 2  >  1  -  M  sup  || Pf a~Pf g  II  it  +  log  2  }/log  (r- 1)  .  (5.6) 

f€y  3  \<a.0<r 

To  make  use  of  this  lower  bound,  we  use  the  metric  dimension  properties  of  J 
and  the  operator  P  to  construct  a  suitable  set  7°  for  which  r  is  large  and 
sup||/5/a-/,/y3||^  is  small.  From  the  definition  (5.1)  of  the  modulus  cr,  choose  a 
subspace  M  of  H  for  which  fiM(4<5)  c  Hq  and  4<5||P||M/|/V/|*  <  2<7(4<5).  A  useful 
lemma  of  approximation  theory  (e.g.  Lorentz,  1966,  p.905)  asserts  that  a  k  dimensional 
ball  of  radius  R  contains  an  Rfl  distinguishable  subset  of  cardinality  at  least  2k. 
Setting  r=2^,  use  this  lemma  to  choose  hx,...,hr  e  Sw(45)  such  that 
||  ha~hp  j|  >  26  and  define  the  25-distinguishable  set  7°  by  fa  as  f°  +  ha ■  for 
a=l,...,r.  By  construction,  for  any  nr  and  /?, 

II Pfa-Pffi II K  *  \\nli  ll/a -fa  II :  *  i^'2  I M I  eW)2  ■  6452  =  16  I M I  cr(45)2 .  (5.7) 

Substituting  back  into  (5.6),  and  performing  some  elementary  algebra,  we  have 
sup,rEf[\f-f\\2  >  62  [l  -  d3n<72(45)]  where  53  is  an  appropriate  constant.  Now 
choose  6  so  that  53/ict2(45)  =  \  and  the  proof  of  Proposition  5.1  is  complete.  □ 
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The  estimation  problem  we  study  can  be  thought  of  as  estimation  of  Qg  ,  where 
geQ  and  Q  (  =P~l)  is  an  unbounded  operator.  The  term  "modulus  of  continuity" 
might  be  more  appropriately  applied  to  a  measure  of  the  rate  of  growth  of  the  singular 
values  of  Q  relative  to  Q.  Indeed  it  is  in  this  form  that  the  similarity  to  the  modulus  of 
Donoho  and  Liu  (1989)  is  clearer.  Now  suppose  that  Q  is  a  translate  g°+Afo  of  a  set 
balanced  about  the  origin  in  K.  We  denote  finite  dimensional  balls  about  0  in  K  by  U. 
Define  the  normalised  radius  p(U)  to  be  the  radius  of  U  divided  by  the  square  root  of 
the  dimension  of  U. 

Define  a  generalised  modulus  of  continuity  of  Q  over  the  parameter  space  Kq  by 

f(e)  =  sup  inf  \\Qv\\h,  (5.8) 

vedU 

where  the  supremum  is  taken  over  the  class  of  finite  dimensional  balls  Uc.K0  for 
which  p(U)—£.  Notice  that  if  Q  is  a  linear  functional  (so  that  (//.||  ’ |j  //)  =  (R,|  * j  ))T 
the  above  definition  reduces  tp 

t{e)  =  sup  [\Qv |  :  || u II a'  =  £  tveK0  for  | r| <  1 } , 

which  is  the  modulus  of  continuity  studied  by  Donoho  and  Liu  (1989). 

It  can  be  shown  that  i  is  approximately  inversely  related  to  the  modulus  a 
defined  at  (5.1)  in  the  sense  that  f(a(<5))  <,  8.  Thus  a~l(e)  >  f(e),  and  so  the  lower 
rate  bounds  derived  from  use  of  a  are  at  least  as  good  as  those  that  would  follow  from 
f.  It  turns  out  that  these  rate  bounds  are  in  fact  equivalent  for  all  the  applications 
discussed  in  this  paper.  These  results  and  extensions  will  be  discussed  more  fully 
elsewhere. 

5.2  Completion  of  proof  of  Theorem  3.2 

We  now  return  to  the  PET  setting  to  prove  two  propositions  that  complete  the 
Theorem  3.1.  Both  these  are  proved  by  finding  reasonable  lower  bounds  to  r(c). 

Proposition  5.2  Subject  to  the  conditions  of  Theorem  3.1,  there  exists  a  constant 
dD  (p ,  C )  >  0  such  that 

rD(n)  >  dD(logn/n^{p+l). 

Proof  Set  H  =  K  =  L1(B,p)  and  P  =  I.  Let  /°  be  the  uniform  density  and 
H0  =  fp,c-f°.  A  good  upper  bound  for  c r(S)  as  defined  in  (5.1)  can  be  obtained  by 
considering  high  dimensional  subspaces  M  subject  to  the  constraint  that  B^(5)  c  H0. 
For  large  p,  let  Mv  =span{pv  :  a ?  <  rjp).  Then  Byf8)  c  H0  when  rip  <C2/S2. 
From  the  definition  of  a(S),  it  follows  that 

cr2(S)  Z  82/ sup{  |,V/„  |  :  np  <  C2/S2  ). 
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Using  x  to  denote  the  characteristic  function  of  a  set,  \Mn  |  =  £/jt{1<(./+1)(£+ 1)^77} 
=  r/lognf  l+o(l))  by  Lemma  4.3.  Hence  <t2(5)  <>  diS2(p+l)^p/log  S~2,  from  which  it 
follows  that  r2(e )  >  d5Uz  log  £~2y^p+{),  so  that  r2(cn~^)  >  d6(log«/n)^+1) . 
Substitute  back  into  Theorem  5.1  to  complete  the  proof.  □ 

» 

Proposition  5.3  Subject  to  the  conditions  of  Theorem  3.1,  there  exists  a  constant 
d/(p,C )  >  0  such  that 

r,(n)  >  d,(l/n)pkp+T> . 

Proof  Now  take  H  and  H0  as  above,  and  let  K  be  the  Hilbert  subspace  of  L2{D,X) 
generated  by  the  onhonormal  set  of  singular  functions  ( ) .  This  time  a  good  bound 
for  cx(<J)  must  use  high  dimensional  subspaces  (with  BM{8)  c  H0)  for  which  in 
addition  ||P||W  is  small.  For  given  rj,  set  Mv  =  span{p;0  :  \p<j+l^ri}.  Then  ||P||^ 
=  max{h2  :  cpv<=Mn)  <  2if~l ,  and  \Mv\>[\t}].  As  in  the  proof  of  Proposition  5.2, 
c  Hq  if  t\p  <  C2/S2.  Substituting  into  (5.1),  we  have,  for  sufficiently  small 

5, 

a 2(S)  <  82  inf  [2n~ll[\n]  :  t]pZC2/52)  =  d152(p+2)!p . 

Consequently  t2(e)  >  d%e2pl('p+2^* and  r 2{cn~l)  >  d9n~p^p+1\  which,  as  above,  can 
be  substituted  into  Proposition  5.1  to  complete  the  proof.  □ 

We  close  this  section  by  remarking  that  Ibragimov  and  Hasminskii  (1981)  and 
Stone  (1982)  have  shown  that  the  minimax  rate  of  convergence  of  global  mean 
integrated  square  error  for  direct  nonparametric  density  and  regression  problems  is 
n-lplap+d)'  where  p  js  the  assumed  amount  of  smoothness  and  d  is  the  dimension, 
d= 2  in  our  case.  They  consider  classes  of  functions  constrained  by  a  Holder  continuity 
condition  of  order  ae[ 0,  1]  on  the  j1*1  derivative,  so  that  p=s+a.  The  extra  logn  term 
in  the  rate  of  convergence  {\ogn/n)2p^{2p+d)  obtained  in  the  present  paper  reflects  the 
slightly  reduced  smoothness  imposed  by  requiring  only  square-integrability  of  the  p* 
weak  derivative. 

6.  Biased  sampling  and  attenuation 

In  any  practical  PET  scan,  not  all  pairs  of  emitted  photons  are  detected.  We  shall 
show  in  this  section  that  two  of  the  main  reasons  for  this  incompleteness  of  sampling 
can  be  placed  within  the  same  mathematical  framework,  and  that  our  results  can,  in 
part,  be  extended  to  account  for  them.  Under  mild  assumptions,  the  incompleteness  of 
sampling  has  no  effect  on  the  minimax  rate  of  consistency  found  in  Theorem  3.1. 


6.1  The  effect  of  the  third  dimension 
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Up  to  now,  we  have  considered  the  detectors  as  forming  a  circle  in  the  plane,  and 
we  have  assumed  that  all  the  paths  of  emitted  photons  fall  in  this  plane.  Of  course,  in 
reality  the  detectors  form  a  ring  of  finite  thickness  d>  0,  and  the  orientation  of  the  line 
of  flight  of  the  photons  is  uniformly  distributed  in  R3.  We  shall  assume  that  the 
emission  density  is  constant  over  the  thickness  of  the  cylindrical  slab  enclosed  by  the 
detector  ring.  Only  emissions  taking  place  in  this  slab  will  be  considered,  since  only 
they  have  any  chance  of  being  detected  at  all. 


Given  any  emission,  the  photon  line-of-flight  is  now  parametrised  by  three 
coordinates  ( s,tp,<p '),  where  (s,<p)  are  the  coordinates  in  detector  space  of  the 
projection  of  the  line  onto  the  detector  plane,  and  the  vertical  angle  <p' (-7tl2«p' 
is  the  angle  between  the  line  and  its  projection.  The  assumption  that  the  line  has 
uniformly  distributed  direction  implies  that,  independently  of  (s,<p),  the  vertical  angle 
has  probability  density  £  cos  <p'  dtp'.  An  emission  line  will  only  be  detected  if  its 
vertical  angle  is  such  that  both  photons  hit  the  detector  ring.  If  the  emission  is 
detected,  only  the  coordinates  s  and  <p  are  observed. 


Condition  on  a  particular  s  and  <p,  and  let  l  =  2(l-s2)K  the  length  of  the 
corresponding  detector  tube.  Assume  that  an  emission  takes  place  at  distance  t  from 
the  centre  of  the  tube  and  at  vertical  position  Z  as  shown  in  Figure  6.1.  Assume  that 

the  projection  of  the  line  of  flight  of  the  emitted  photons  has  coordinates  (s,p).  Let 

{-<px  ,tp2)  be  the  range  of  vertical  angles  over  which  both  photons  will  hit  the 

detectors.  For  given  <px  and  <p2  the  probability  of  detection  will  be 


f  2  |cos  <p'  dtp’  -  \(sin  <px  +  sin  <p2). 


We  have  (see  Figure  6.2) 


Z/it+V)  if  Z  <  (t+{l)d/l 
( d-Z)j{\l-t )  otherwise  . 


By  assumption,  Z  is  uniformly  distributed  over  (0 ,d).  By  elementary  calculus,  the 
expected  value  of  sin  tp2  over  this  distribution  of  Z  is  equal  to 

d~l  j^'^sin  [tan-1  { z/{\l+t))]  dz  +  d~x  J^^sin  [tan-1  {(d-z)/{\l-t))]dz 

=  d~’l(\l+t)  sin  (tan ~lu)du  +  d~x{\l- 1)  j^1  rin  (tan-1u)du 

=  d~xl  ((l+d2//2)*-!}  (6.1) 


By  symmetry,  the  expected  value  of  sin  tpx ,  and  hence  the  expected  probability  of 
detection  conditional  on  sj  and  <p,  will  also  be  equal  to  the  expression  in  (6.1).  Note 
that  this  probability  is  independent  of  t  and  only  depends  on  the  tube  length  l. 
Letting  aiD(s,<p)  be  the  probability  that  an  emission  in  tube  (s.<p)  is  actually  detected. 
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it  follows  from  (6.1)  that 

aZD(s,<p)  =  {4(l-s2)<r2  +  l}*  -  2(1 -s2)^-1 . 

This  quantity  increases  as  s  increases,  reflecting  the  fact  that  emissions  in  shorter  tubes 
(large  s  )  are  more  likely  to  be  detected.  We  have,  finally, 

0<(l+4<r2)l-2i“1  a(s,<p)  Z  1  for  all  se[0, 1]. 


6.2  Attenuation 

The  other  effect  we  shall  consider  is  attenuation,  defined  as  being  the  loss  of  a 
detection  caused  by  the  absorption  or  scattering  of  one  of  the  photons  in  flight.  Let 
us  model  the  probability  of  such  loss  of  a  photon  as  it  travels  between  x  and  x+dx  as 
/r(x)|dx|  and  assume  that  /t(x)  is  bounded.  Suppose  an  emission  occurs  at  a  point 
Xq  and  that  y  is  the  line  of  flight  of  the  emitted  photons.  Let  y+(Xo)  and  y_(xo)  be  the 
half-lines  of  y  emanating  from  x0,  and  assume  y  intersects  the  detector  ring.  By 
standard  Poisson  process  theory,  the  probability  that  neither  photon  will  be  lost  is 
given  by 

exp(-[  u(x)dx}exp(-f  u(x)dx)  =  exp{-[  p(x)dx}  =  aA(s,<p),  say  . 

Just  as  in  Section  6.1,  the  probability  that  the  emission  will  be  detected  depends  only 
on  the  detector  tube  (s,<p)  and  is  independent  of  the  emission’s  position  within  that 
tube.  In  general,  if  both  effects  are  considered,  the  probability  that  any  particular 
detection  will  not  be  lost  will  be  a3D(s,ip)aA(s,<p).  Both  effects  are  important  in 
PET;  intensities  reconstructed  ignoring  them  can,  in  practice,  be  too  low  by  a  factor  of 
three  in  the  centre  of  the  image  (F.  Natterer,  personal  communication).  A  common 
technique  for  correcting  for  attenuation  is  to  estimate  it  separately,  for  example  by  a 
transmission  scan. 

6.3  A  general  framework  and  the  extension  of  our  results 

The  two  effects  we  have  discussed  can  be  combined  by  assuming  the  existence  of 
a  function  a(s,<p),  0<a(s,/p)<l  such  that  a  positron  emission  at  (.rt  ,x2)  gives  rise  to  a 
detection  at  (5,0)  as  defined  in  Section  2.1  with  probability  a(S, O)  conditional  on 
(5,0);  with  probability  l-a(5,0)  the  detection  is  lost.  It  follows  from  this 
formulation  that  the  observed  detections  will  form  a  biased  sample  with  density  in 
detector  space  with  respect  to  dA(s,<p) 

ga{s,tp)  =  Paf(s,cp)  =  a(s,<p)Pf(s,<p)/jD  Pf(s,p')  a{s,cp’)cU(s ,<p'). 

Let  'Tg(n)  be  the  class  of  all  estimators  of  /  based  on  a  sample  of  size  n  from  Paf , 
and  let  rB(n )  be  the  minimax  mean  integrated  square  error  over  /  in  'Tg(n)  and  /  in 

Vc- 


Theorem  6.1.  Suppose  that  infDa(s,<p)  =  an£^  make  tite  assumptions  of 

Theorem  3.1.  Then 

rB(n)  =  «-^+2> .  (6.2) 

Proof.  The  order  of  magnitude  in  (6.2)  is  of  course  the  same  as  that  obtained  for 
unbiased  indirect  estimation  in  (3.5). 

Suppose,  first,  that  /  is  the  least  favourable  density  in  JpC  for  estimation  by 
estimators  in  T/(n).  Let  n'  -  [fan].  Suppose  Yx  is  an  i.i.d.  sequence  drawn 

from  Pf.  Construct  an  i.i.d.  sequence  Z{,...  from  Paf  by  including  each  T,  in  the 
sequence  with  probability  a(Yi)>aQ.  Let  f  be  the  estimator  of  /  based  on  Zx 
using  the  minimax  estimator  in  TB(n'),  so  that  M(f\f)iLrB(n').  Now  let  JV  be  the 

A  /A 

number  of  Y1,..,Yn  that  are  included  in  the  Zy  sequence,  and  let  f\  be  equal  to  /  if 
N>n'  and  1  otherwise.  Since  fx  is  based  on  Yx . Yn,  and  /  is  least  favourable  for 

A  A 

T,(n),  we  have  M(J\  ;/)£fy(/i).  By  an  elementary  argument,  M{f\ ;/) 
<  M(J\f)  +  P{N<n')j{f-l)2dp,  so  that  rfl(n/)^r/(n)  -  P(N<n'),  making  use  of 
Proposition  2.3  and  the  assumption  C<2^-1)  to  bound  j(f- 1)2  by  1.  A  crude  bound 
now  suffices  for  P(N<n');  since  N  is  stochastically  larger  than  a  Bi{n,aq)  random 
variable,  P(N<n')  <  P  { Bi[n,aQ )  <  \naQ  )  =  0(/i-1)  by  Chebyshev’s  inequality. 
We  conclude  that  rfl([imJo])  £  r,{n)  -  0(n~l). 

Now  reverse  the  role  of  biased  and  unbiased  samples  throughout  the  argument. 
If  Z[ ,...  is  an  i.i.d.  sample  from  Paf,  then  a  sample  Y{ ....  from  Pf  can  be  constructed 
by  including  each  Z,  with  probability  Oo/afZ,);  this  quantity  necessarily  lies  between 
Oq  and  1..  The  analogous  argument  to  that  used  above  yields  that 
r/([ina0])^r3(n)-<9(Ai~l)  .  Applying  Theorem  3.1  it  now  follows  that  rB(n)  has  the 
same  order  of  magnitude  n~p^p+2)  as  r,(n).  □ 

There  is  of  course  a  distinction  between  a  biased  sample  of  n  observations  drawn 
from  Paf  and  a  censored  sample  consisting  of  all  the  observations  that  are  detected 
arising  from  n  emissions  in  brain  space.  The  censored  sample  will  consist,  in  the 
notation  of  the  proof  of  Theorem  6.1,  of  N  observations  from  Paf.  Implicit  in  the 
proof  of  Theorem  6.1  is  a  demonstration  that  the  minimax  mean  integrated  square 
error  for  estimation  based  on  this  censored  sample  will  have  the  same  order  of 
magnitude  as  rt(n)  under  the  assumption  aQ> 0. 

For  the  third  dimension  effect,  as  the  detector  ri  thickness  d— >0,  we  have 
ci30(s,(p)-id(l -s2)~*  and  a0— >0.  In  the  limiting  case,  the  biased  sample  density  will 
be  proportional  to  ( 1  -s2yiPf(s,<p),  whose  ratio  to  Pf(s.ip)  is  unbounded  as  s~ >1. 
Theorem  6.1  no  longer  applies,  but  it  can  be  shown  that  the  biased  sampling  has  at 
most  a  logarithmic  order  effect,  in  that  the  order  of  magnitude  of  rB(n)  lies  between 
(n  log  n)~pl(p*2'>  and  n~pl<p+2\  This  is  a  consequence  of  the  following  more  general 


result  on  singular  biased  sampling,  whose  proof  is  omitted. 


Theorem  6.2:  Suppose  p2 1  and  0<C<2^p~l) . 

(a)  If  fDa(s,<p)~Hl-s1)~1dA(s,<p)< «,  then  there  exists  cx  such  that 

(bl)  If  jDa(s,<p)(l-s2)~ldA(s,<p)<<>°,  then  there  exists  c2  such  that 

(b2)  If  J ' Da{s,(p)dX{s,<p)<oo,  and  sup^l-s2)*  j^als.^dgxoo,  then  there  exists  c3 
such  that  rfl(n)>c3(rtlog/irp/(p+2) . 

For  a(r,<p)=(l-s2r*,  the  conditions  of  (a)  and  (b2)  hold  but  the  integral  in  (bl)  is 
infinite. 


7.  Alternative  error  measures 

Our  results  can  be  extended  to  some  more  general  measures  of  the  discrepancy 
between  the  estimator  and  the  unknown  function  than  mean  integrated  square  error. 
We  can  treat  a  class  of  losses  that  takes  into  account  the  closeness  of  derivatives,  as 
well  as  values,  of  the  estimate  to  those  of  the  true  unknown  function;  these  losses  take 
more  account  of  the  "shape"  of  the  function  than  does  ordinary  mean  integrated  square 
error. 

Define  measures  pp  as  in  Proposition  2.2.  It  is  noted  in  the  Appendix  that,  for 
integers  q> 0,  the  squared  norm 

S\f\2dul+  Z  J  \dr'  "'fldxfdxf  1 2dVi  (7.1) 

r\+'\-q 

is  equivalent  to 

ll/ll  J  =  I  0>1)«(*+1)*/,2-  (7.2) 

j.k>  0 

For  non-integer  values  of  q,  the  norm  HMI^  will  be  a  more  general  Sobolev  norm 
(Adams,  1975),  although  some  care  will  be  necessary  because  of  the  nonstandard 
dominating  measures  this  is  a  topic  for  future  investigation. 

We  can  now  state  and  prove  a  theorem  that  gives  the  exact  minimax  rates  in  the 
|| Ml  q  norm  for  both  direct  and  indirect  estimation.  Theorem  3.1  is  the  special  case 
q=0  and  it  can  be  seen  that  the  rates  avaiPbie  are  both  reduced,  in  a  natural  way, 
when  higher  order  norms  are  used. 
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Theorem  7.1:  For  fixed  pt  1,  0<C<2*(p_1\  and  0<q<p, 
.  inf  sup  E||/-/||2  =  (logn/rt)(^^^+I) 

/€T0(n) 


and 

inf  sup  £||/-/||2  =  (1  /nft-dW. 

fer,(n)  fej,.c 


Proof  The  proof  is  analogous  to  that  of  Theorem  3.1  and  we  shall  confine  ourselves 
to  a  brief  oudine  of  the  necessary  changes.  Define  c2=cfk=(j+\)q(k+\)q  and 
r=diag(cv).  To  obtain  upper  bounds,  define  the  surrogate  risk  Af  *(/;/)  = 
J^Cylvar ify  +  (E/fv-fv)}.  The  result  corresponding  to  Proposition  2.1  is  immedi¬ 
ate.  As  in  Lemma  4.1,  A/?*(/;/)  =  rt-1trrW(/-e0e0T)WTr  +  ||r(/-WB)f||2.  As  in 
Lemma  4.2,  the  minimax  surrogate  risk  for  linear  estimates  over  the  ellipsoid  Tp  c  is 
given  by  /i-1  £(,,  )(c}b~2-cvb~1avy^),  where  y  is  chosen  to  satisfy 
n_lZ(»7)(av^r2r“i-^26v2)=C2.  So  long  as  q<p,  we  obtain  surrogate  linear 
minimax  rates  of  convergence  equal  to  (n/\ogn)~(-p~q)^p+{)  and  n~(p~q)^p+2)  in  the 
direct  and  indirect  cases  respectively.  Clearly  it  would  be  possible  to  obtain  more  pre¬ 
cise  results  corresponding  to  Theorem  3.2,  but  we  shall  refrain  from  doing  so. 

The  methods  of  Section  5  show  that  these  are  in  fact  the  exact  minimax  rates  of 
convergence  for  the  || *  ||  ^  norm  for  general  estimators.  From  Proposition  5.1,  it  is  only 
necessary  to  compute  the  modulus  o(5)  of  (5.1),  now  with  respect  to  the  ||-||?  norm 
on  H.  This  calculation  goes  as  in  Propositions  5.2  and  5.3,  even  using  the  same  defini¬ 
tions  of  the  subspaces  Mn.  Since  the  IHI^  norm  is  now  used  on  H , 
\\P\\ltn  =  sup(b2c;2  :  <pvsMn}  and  S,Wit(<5)  c  if  pp'p<C2/S2.  With  these 
changes,  the  proof  is  completed.  □ 


8.  Some  concluding  remarks 

This  paper  has  focused  on  lower  and  upper  bounds  for  one  particular  bivariate 
density  estimation  problem  for  indirect  data.  The  same  formalism  applies  to  many 
other  density  and  regression  estimation  problems.  The  celebrated  "unfolding"  problem 
for  sphere  size  distributions  is  an  example  involving  univariate  density  estimation  from 
indirect  data  and  the  singular  value  decomposition  of  the  Abel  transform.  For  recent 
results  and  further  references  on  this  problem,  see,  for  example,  Hall  and  Smith 
(1988),  Nychka  and  Cox  (1984),  Silverman  et  al.  (1988)  and  Wilson  (1988). 

Noisy  integral  equations  of  the  form  yi=(Pf)(li)+ei  can  be  treated  using  our 
methods,  at  least  under  appropriate  assumptions  on  the  distributions  of  (:,,£,).  For 
example,  if  the  observation  points  f,  follow  a  known  attribution  X (dt)  and  the  errors 
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£,  are  independently  Gaussian  (0 ,a2),  then  the  information  divergence  between  the 
hypotheses  fi  and  /2  is  K(Pj*\Pj*^)  =  { Pfl(t)-Pf2(t)}2  so  that  the  lower 

bound  methods  of  Section  5  immediatedly  apply.  Upper  bound  results  are  given,  for 
example,  by  Nychka  and  Cox  (1984). 

For  a  generic  one-dimensional  problem  with  singular  value  decomposition 
Pcpv-bv<pv,  bv~v~P,  and  with  ellipsoid  determined  by  =i/2a,  corresponding  to  "a 
derivatives",  the  exact  minimax  rate  of  convergence  of  the  mean  square  error  in 
n~2a/(2a+20+i) '  -phjs  should  be  compared  with  die  exact  rate  of  /|-W(2a+1)  for  the 
corresponding  "direct"  case.  Related  calculations  for  a  large  class  of  one-dimensional 
convolution  equation  models  appear  in  Wahba  and  Wang  (1987). 

One  important  topic  for  future  attention  is  the  effect  of  the  discretisation  of  detec¬ 
tor  space  due  to  the  finite  size  of  the  detectors.  It  is  clear  intuitively  that  if  the 
number  of  detectors  is  sufficiently  large  relative  to  the  size  of  the  sample  collected, 
then  the  minimax  rates  will  not  be  affected,  arid  of  course  it  would  be  interesting  to 
quantify  this  notion  more  precisely.  Some  PET  machines  (see,  for  example,  Snyder 
and  Politte,  1983)  are  able  to  use  time-of-flight  information  to  provide  an  approximate 
indication  of  the  place  in  the  detector  tube  where  an  emission  occurs.  This  is  usually 
accompanied  by  a  loss  in  detector  efficiency  and  hence  a  smaller  sample  size  n.  It 
would  be  desirable  to  extend  our  framework  to  make  a  quantitative  evaluation  of  this 
trade-off.  Kaufman,  Morgenthaler  and  Vardi  (1983)  report  some  earlier  work  on  this 
issue. 

Another  issue  that  could  be  explored  is  the  further  extension  of  cur  results  to  deal 
with  more  general  metrics  on  images.  Finally  there  is  very  little  known  about  the 
theoretical  performance  of  algorithms  commonly  used  in  practice;  our  results  at  least 
provide  a  framework  and  a  benchmark  against  which  particular  algorithms  can  be 
judged. 

APPENDIX 

Proof  of  Proposition  2.1:  The  proof  is  elementary.  Consider  the  direct  case  first. 
Suppose  i  is  a  random  variable  drawn  from  the  uniform  density  and  that  AT  is  a 
random  variable  with  density  /.  Then  var^/Uj/varj/U) 
=  n_1varw(x,A')/n"1varw(,t,Z).  Now 

varw(.t,X)  <  £(  w(r,X)-Ew(x,Z)}2  =  J(  w(x.q)-Ew(x,Z)}2 f(4)dn(q) 

<  sup Bfj(w(x,g)-Ew(x,Z)}2dq  =  sups/var  w(.r,Z)  . 

and  similarly  varw-(.r,E)  <  supfl(l//)»3r-..-(x,X)  =  varw(.t,X)mfB/.  Thus 
var///vari/'  ar>ti  hence  M/M*,  is  bounded  between  in  if  and  sup/.  In  the  indirect 
case  an  exactly  analogous  argument  bounds  Mj\i*  between  inf0P/  and  sup DPf.  It 
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follows  from  (2.1)  that  inf[)Pf>  inffl/  and  that  sup pPf  £  sup $/,  completing  the  proof. 

Proof  of  Proposition  2.2:  We  employ  Gegenbauer  (ultraspherical)  polynomials  as 
normalised  in  Gradshteyn  and  Ryzhik  (1980,  p.827).  An  orthogonal  basis  for 
L  {B,jj.a)  is  given  by  the  polynomials 

<Pjl(x)  =  (2x)~l f£* el(J-k)aCjlk(uerx)d0  j,k> 0,  u9=(cos0,sin 0)T . 

Defining  the  operator  (Pa/)(s,5)=£{/(X)|u9TX=s}  where  X~/xa,  the  polynomials 
<Pj%  are  the  pullback  by  the  adjoint  P£  of  the  singular  functions  ei(J~k)0Cj*+k(s)  of 
PaPa •  This  construction  of  the  SVD  of  Pa  is  explained  in  Johnstone  (1988), 
following  Davison  (1981,  1983)  but  with  different  notation  and  normalisations.  It  can 
be  shown  that  <jijk  =  (j+k+l)~^<pjk,  so  that  /  =  £(/+£+ 1)*/^*. 

Let  Dz  =  \(dldxx  -  idldxi )  and  D-  =  \{dldxx  + idldx 2).  Since  ( d/dt)C%=2aC%Z[ , 
we  have,  setting  <pjk= 0  if  j  or  k< 0,  Dz  < p}°  =  atp™ [j*  and  D-<p^k  =  The  raising 

of  the  index  from  a  to  a+1  leads  us  from  the  original  measure  /x  of  Section  2  to  the 
family  np+\,  so  that,  for  example,  the  family  of  derivatives  Dztpv  and  D^pv  is 
orthogonal  with  respect  to  /x2,  not  /ij. 

It  is  shown  in  Johnstone  (1989)  that  if  r+s=p,  then,  for  certain  constants  cjkrs  all 
falling  in  [(p+l)-2p+I  .(p+Dp2?], 

j(DrzD§f)2dnp+ ,  =  p!£  (j+k+ 1  )f2k J ( <pf*r\k_s)2 <Uip+  l 

=  ICjUj+mk+iyip+Df2  . 
jzr 
kis 

Standard  arguments  of  analysis  complete  the  proof. 

Proof  of  Proposition  2.3:  In  the  complex  form  of  the  basis,  we  have 

fir, 6)  =  £  fjk  {j+k+\fel(i-k)dZ}rkkl(r). 

(M)ejV' 

Zemike  polynomials  satisfy  sup  \Zj„(r)\  =  Z^(l)  =  1,  as  a  consequence  of  the 

0<r<  i 

representation  in  terms  of  Jacobi  polynomials:  Z[+2s(r)  -  rlP$0J\2r2  - 1),  together 
with  the  results  of  Szego  (1939,  p.163),  applied  to  the  polynomials  Q1s{t)-P^'l{2t-l) 
as  s  varies.  Hence 

sup | /-l |  <  £  (j+k+\f\fjlc\ 

.V'VO.O) 

The  ellipsoid  Jp  C  has  exactly  the  same  description  in  terms  of  either  the  real  or  the 
complex  form  of  the  basis.  Setting  x,k  =  (J+'LY’,2(k+Vp/2\fjk\/'.  we  obtain 


supsup|/-l|  £  C  sup  {  £  0'+*+ !)*(/+ 0  pl2(k+l)~pl2Xjk  :  2 

,V'\(0,0)  N’VO.O) 

<  C  sup  (j+k+lAj+l)-p/2(k+l)~Pn  =  C  2<l~p)^2 , 

N'\(  0,0) 

provided  p>l.  To  complete  the  proof,  note  that  the  linear  function  1  +  2(X~p)^1Crcos6 
falls  in  Jp  c  and  satisfies  sup  j/-l  |  =  C  2{l~p^2. 
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TABLE  1 

Constants  needed  for  Theorem  32.  Euler' s  constant  yE  -  0.57722...  . 


C\  =  exp  i-cjc-,) 

c5  =  (p+l)l/0’+1)c?/0’+1> 

c2  =  c7_1(/7+l)exp{(/7+l)c8/c7} 

c6  =  ±{z2p/3(p+4)  }pI(p+V(p+2)2/(>p+V 

=  p(p+2)-1 

c7  =  p(p+ir1(p+2)~1 

c4  =  2y£-(p+4)/(p+2) 

c8  =  2r£c7-{4(p+2)-2-(p+ir2J 

TABLE  2 

Equivalent  direct  sample  sizes  m*(n)  to  achieve  the  same  surrogate  linear 
minimax  risk  over  smoothness  class  p  as  for  an  indirect  sample  of  size  n. 


n=107 

n=108 

ratio  m(108)/m(107) 

P=1 

1.93x10s 

1.03xl06 

5.34 

P~2 

4.85x10s 

3.12xl06 

6.44 

p=5 

1.29xl06 

1.05xl07 

8.09 

Figure  Captions 

Fig.  2.1.  The  patient  and  the  detector  circle 
Fig.  2.2.  Parametrising  the  line  / 

Fig.  2.3.  Transforming  the  coordinates 
Fig.  6.1.  The  two  cases  for  <p2 
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SUMMARY 

There  are  many  practical  problems  where  the  observed  data  are  not  drawn  directly 
from  the  density  g  of  real  interest,  but  rather  from  another  distribution  derived  from  g 
by  the  application  of  an  integral  operator.  The  estimation  of  g  then  entails  both 
statistical  and  numerical  difficulties.  A  natural  statistical  approach  is  by  maximum 
likelihood,  conveniently  implemented  using  the  EM  algorithm,  but  this  provides 
unsatisfactory  reconstructions  of  g.  In  this  paper,  we  modify  the  maximum 
likelihood  /  EM  approach  by  introducing  a  simple  smoothing  step  at  each  EM 
iteration.  In  our  experience,  this  algorithm  converges  in  relatively  few  iterations  to 
good  estimates  of  g  that  do  not  depend  on  the  choice  of  starting  configuration.  Some 
theoretical  background  is  given  that  heuristically  relates  this  smoothed  EM  algorithm 
to  a  maximum  penalised  likelihood  approach.  Two  applications  are  considered  in 
detail.  The  first  is  the  classical  stereology  problem  of  determining  particle  size 
distributions  from  data  collected  on  a  plane  section  through  a  composite  mrr|inm  The 
second  concerns  the  recovery  of  the  structure  of  a  section  of  the  human  body  from 
external  observations  obtained  by  positron  emission  tomography;  for  this  problem,  we 
also  suggest  several  technical  improvements  on  existing  methodology,  in  particular,  a 
new  pixellation  of  the  circular  image. 

Keywords:  ILL  POSED  PROBLEMS;  INDIRECT  OBSERVATIONS;  INTENSITY 
ESTIMATION;  MAXIMUM  LIKELIHOOD;  PENALISED 
LIKELIHOOD;  POSITRON  EMISSION  TOMOGRAPHY;  SMOOTHING; 
STEREOLOGY. 
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1.  INTRODUCTION 

A  wide  variety  of  practical  problems,  in  fields  ranging  from  medicine  to  remote  sens¬ 
ing,  involve  indirect  observations.  Suppose  that  events  occur  on  a  space  Y  according 
to  a  nonhomogeneous  Poisson  process  of  rate  g(y).  These  events  cannot  be  observed 
directly;  instead,  an  event  at  a  point  y  gives  rise  to  an  observable  datum  at  a  point  x  in 
some  space  X.  The  observed  datapoints  fall  as  a  nonhomogeneous  Poisson  process  on 
X  with  intensity  /(r),  where  /  and  g  are  related  by  the  integral  equation 

fix)  =  J  K(x, y)  g(y)  dy  .  (1.1) 

Here,  K(x,y)  is  a  non-negative  kernel  function  which  is  assumed  known,  fa  some 
contexts  X  and  Y  are  the  same  space,  but  we  shall  see  that  this  is  by  no  means  always 
the  case. 

In  this  paper,  we  shall  introduce  and  discuss  a  simple  general  approach  to  the 
estimation  of  the  non-negative  function  g  from  such  indirect  data.  The  general  prob¬ 
lem,  and  our  solution  to  it,  will  be  discussed  in  two  particular  contexts.  The  first, 
involving  univariate  functions,  is  the  classical  stereology  problem  of  determining 
particle-size  distributions  from  data  collected  on  plane  sections  through  a  composite 
medium.  The  second  is  an  interesting  image  processing  problem,  the  recovery  of  the 
structure  of  a  section  of  a  radioactive  emission  intensity  in  the  human  body  from  exter¬ 
nal  observations  obtained  by  positron  emission  tomography  (PET).  Our  intention  is  to 
contribute  concretely  to  these  problems  and  also  methodologically  more  generally. 

Equations  of  the  form  (1.1)  are  called  first  kind  integral  equations  and  have  been 
the  subject  of  much  study  by  numerical  analysts,  mainly  from  the  point  of  view  that 
the  function  /  is  observed  accurately.  Most  of  this  work  does  not  take  account  of  con¬ 
straints  on  g.  Another,  more  statistical,  problem  that  has  been  studied  is  the  case  where 
values  of  /  itself  are  observed  subject  to  random  error,  see,  for  example,  Titterington 
(1985a)  and  O’Sullivan  (1986).  The  relationship  of  this  statistical  problem  to  the  one 
we  shall  discuss  is  precisely  that  between  regression  and  density  estimation  for  directly 
observed  data.  The  problems  have  some  similarities  but  sufficient  differences  to  make 
distinct  approaches  appropriate. 

Our  work  builds  on  the  paper  of  Vardi,  Shepp  and  Kaufman  (1985)  who  give  a 
good  introduction  to  the  statistical  aspects  of  the  PET  problem  and  provide  a  method 
for  its  solution  based  on  the  EM  algorithm  (Dempster,  Laird  and  Rubin,  1977,  Little 
and  Rubin,  1987).  In  general,  a  natural  statistical  approach  to  the  estimation  of  g  is  by 
maximum  likelihood  (ML)  and  it  is  to  the  solution  of  the  ML  problem  that  the  EM 
algorithm  is  addressed.  However,  as  we  shall  see,  ML  reconstructions  of  g  are  unsatis¬ 
factory.  As  in  nonparametric  curve  and  surface  estimation  generally,  ML  yields  "noisy" 
or  "spiky"  estimates  that  do  not  fully  reflect  knowledge  about  the  structure  of  the  prob¬ 
lem  under  consideration,  and  some  kind  of  smoothing  is  necessary  to  provide  good 
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estimates  of  g;  see,  for  example,  Silverman  (1985a,  1986)  and  Besag  (1986).  This  is 
because  of  the  high  or  infinite  dimensionality  of  the  parameter  space.  In  the  problem 
we  are  considering,  the  ill-posed  nature  of  the  inversion  of  the  integral  equation  (1.1) 
exacerbates  this  difficulty.  The  classical  mathematical  work  on  integral  equations  (e.g. 
Tikhonov  and  Arsenin,  1977)  makes  it  clear  that  smoothing  would  be  necessary  in 
numerical  reality  even  if  /  were  observed  to  arbitrary  accuracy,  for  example,  from  an 
infinite  number  of  observations. 

The  EM  algorithm  is  an  iterative  approach  that  increases  the  likelihood  of  the 
estimate  of  g  at  each  iteration.  Each  stage  of  the  algorithm  consists  of  an  E  (for  expec¬ 
tation)  step  and  an  M  (for  maximisation)  step.  Our  proposal  is  to  introduce  a  third  S 
(for  smoothing)  step  at  each  iteration  where  the  current  estimate  of  g  is  smoothed  in  a 
suitable  way.  The  EMS  algorithm  maintains  various  advantages  of  the  EM  algorithm 
but  appears  to  eliminate  some  of  its  disadvantages.  Using  very  simple  smoothing 
schemes,  we  have  found  that  the  algorithm  converges,  in  a  relatively  small  number  of 
iterations,  to  good  estimates  of  g.  For  the  problems  we  shall  discuss,  a  little  smoothing 
goes  a  long  way. 

The  general  structure  of  such  smoothed  EM  algorithms  for  indirect  observation 
problems  is  set  out  in  Section  2.  The  stereology  example  is  discussed  in  detail  in  Sec¬ 
tion  3  and  the  PET  example  in  Section  4.  Our  treatment  of  the  PET  example  incor¬ 
porates  some  other  algorithmic  improvements  over  those  of  Vardi  et  al.  (1985)  and 
others.  Some  theoretical  background  is  given  in  Section  5  that  heuristically  relates  the 
EMS  algorithm  to  a  maximum  penalised  likelihood  approach  in  which  the  penalty  term 
is  quadratic  in  the  square  roots  of  the  intensities. 

2.  THE  EM  ALGORITHM  AND  SMOOTHING 

2.1  Notation  and  Preliminaries 

For  practical  reasons,  the  data  with  which  we  are  concerned  arise  in  histogram 
form,  so  the  data  space  X  will  be  divided  into  bins.  Index  the  data  bins  by  r=l,...,T 
and  denote  the  observed  counts  in  these  bins  by  n(t),  t=l . T.  To  facilitate  recon¬ 

struction,  we  also  introduce  a  discretisation  of  the  space  Y  into  bins;  let  s=l,...,S 
index  these  "reconstruction''  bins.  Note  that  the  discretisation  of  the  data-gathering 
aspect  of  both  our  applications  is  an  irremovable  physical  constraint,  while  the  recon¬ 
struction  discretisation  arises  from  algorithmic  considerations.  We  shall  seek  to  recon¬ 
struct  the  discretised  version  of  g  in  (1.1),  denoting  expected  total  occurrences  in  bin  s 
by  A(s),  The  discretised  version  of  the  kernel  function  will  be  denoted  by 

p(s,t),  s=l,...,S,  f=l,...,T;  assuming  that  g  is  constant  within  each  bin,  p(s,t)  is  the 
integral  of  K(x,y)  over  x  in  bin  t  and  y  in  bin  s,  divided  by  the  size  of  bin  s.  Write 


<7(s)  =  £p(s,t),  5=1,. ...5.  Neglecting  the  variation  of  AT  over  bin  s,  we  have  the 

t 

appealing  interpretation 

p(s,t)lq(s)  -  Probt  datum  counted  in  bin  t 1  datum  is  observed, 

having  arisen  from  an  event  in  bin  5). 

Define  k(s,t)  to  be  the  number  of  events  occurring  in  bin  5  which  contribute  to 
the  count  in  bin  t.  It  is  immediately  clear  that  all  the  k(s,t)' s  are  independent  with, 
for  each  s  and  r, 

k(s,t)  -  Poisson  {A(5)p(s,r)}. 

The  observed  data  arise  simply  from  these  as  n(t)  =  £k(s,t)  so  that,  for  r=l,...,r, 

S 

n(t)  -  Poisson  {£A(s)p(5,r)} ,  (2.1) 

i 

independently  for  each  r.  On  the  other  hand,  an  important  set  of  unobservables  is  the 
5-bin  counts,  m(s)  =  ££(5,r),  5=1... .,5.  All  these  m’s  arc  also  mutually  independent 

t 

and  distributed  as: 

m(s)  -  Poisson  (A(s)<7(s)}. 

Define  m  =  n  =  ,  A  =  (A(l) . A(S))r  and  k  to 

be  the  (5xT)  matrix  with  (s.r)’th  element  k(s,t). 

22  The  EM  Algorithm 

Consider  the  estimation  of  A  by  maximising  the  tog  likelihood,  /(n  |A),  based  on 
the  data  n.  These  data  can  be  regarded  as  an  incomplete  version  of  the  complete  data, 
k,  which  we  would  like  to  have  been  able  to  observe.  Dempster  et  al.'s  (1977)  EM 
algorithm,  applied  to  the  PET  version  of  the  present  context  by  Vardi  et  al.  (1985), 
then  gives  a  two  step  iteration  for  increasing  the  likelihood  of  a  current  estimate  A(,) 
of  A.  In  the  E  step  for  the  current  problem,  we  find  the  expected  value  of  the  complete 
data,  given  the  incomplete  data,  under  the  current  estimate  of  parameter  values;  in  the 
M  step,  we  find  the  ML  estimate  of  the  parameters  using  the  estimated  complete  data 
from  the  E  step.  From  Vardi  et  al.  (1985),  this  gives: 

E  STEP 

compute  k(s,t)  =  n(t)  —  ■*’()..  for  each  s  and  t, 

Z^l\np(r,t) 
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M  STEP 

set  A(‘+I)(s)  =  £ k(s,t)lq(s )  for  each  s. 

The  two  steps  combine  to  give  the  simple  updating  formula 

A<‘+1)(*)  =  £n(f)  (2.2) 

7  £A(l)(r)p(r,r) 

r 

for  ,r=l,...,S.  An  even  simpler  interpretation  of  EM  is  possible  in  this  case,  most  obvi¬ 
ously  if  we  treat  m  rather  than  k  as  the  complete  data:  estimate  m  by  its  current 
expectation,  fh(‘\  given  A(l)  and  n,  use  fh(l)  as  the  "new  data”,  then  iterate. 

As  well  as  its  conceptual  simplicity,  the  EM  algorithm  has  other  apparendy 
appealing  properties.  First,  it  necessarily  increases  the  log  likelihood  at  each  iteration 
(Dempster  et  al.,  1977)  and,  since  the  log  likelihood  at  (2.1)  is  concave,  convergence 
of  the  algorithm  is  guaranteed  in  theory.  There  is  not  usually,  however,  a  unique  ML 
solution  (certainly  not  when  S  >  T),  so  the  EM  converges  to  one  of  the  reconstruc¬ 
tions  maximising  the  likelihood,  that  one  depending  on  the  choice  of  the  initial  values 
2(0).  A  second  consideration  is  that  each  A(,)(s)  is  automatically  non-negative  pro¬ 
vided  the  initial  image  is.  Taking  account  of  this  non-negativity  constraint  can  be 
important;  see,  for  example,  Bertero  and  Dovi  (1981).  In  methods  other  than  EM,  this 
constraint  needs  to  be  incorporated  at  considerable  cost  in  computational  complexity  or 
else  ignored  with  detrimental  repercussions  for  quality  of  reconstruction.  Thirdly,  we 
note,  as  do  Vardi  et  al.  (1985),  that  the  EM  updating  formula  (2.2)  also  arises  direedy 
from  the  likelihood  at  (2.1)  as  an  iterative  solution  to  the  Kuhn-Tucker  conditions  for  a 
maximum.  The  EM  is  just  one  possible  optimisation  algorithm  for  this  problem  and 
the  question  arises  whether  there  are  advantages  to  be  had  using  an  alternative  optimi¬ 
sation  technique.  Kaufman  (1987)  investigates  this  in  the  PET  context  Although  it  is 
possible  to  accelerate  the  optimisation  in  its  early  stages,  the  EM  proves  to  be  a  sensi¬ 
ble  approach  to  the  computation  of  ML  estimates.  Kaufman  (1987)  argues  that  it  can 
be  thought  of  as  a  "preconditioned"  steepest,  ascent  method,  having  properties  similar 
to  steepest  ascent  in  many  situations  and  considerably  improving  on  it  in  others. 

Vardi  et  al.  (1985),  however,  found  that  in  practice  the  convergence  to  an  ML 
estimate  is  exceedingly  slow.  Furthermore,  as  the  iterations  proceeded  beyond  a  certain 
point,  the  quality  of  the  reconstructions  actually  deteriorated,  and  we  shall  see,  in  the 
computationally  simpler  problem  of  Section  3,  that  an  ML  estimate  itself  is  unsatisfac¬ 
tory.  Their  proposed  solution  was  to  start  with  a  uniform  image  and  to  abandon  any 
attempt  to  iterate  to  convergence;  instead,  they  terminate  the  process  after  a  chosen 
number  of  steps  (probably  a  long  way  from  convergence).  In  this  way,  Vardi  et  al. 
(1985)  obtained  pleasing  reconstructions  for  the  PET  problem. 


We  feel  it  is  philosophically  more  satisfactory  to  abandon  the  aim  of  finding  ML 
estimates  altogether  and  to  replace  the  technique  just  described  by  an  explicit  smooth¬ 
ing  procedure.  Also,  we  seek  estimates  that  are  the  realisable  limits  of  an  algorithm 
that  actually  converges  in  a  reasonably  small  number  of  iterations  and  that  yields 
results  independent  of  the  starting  configuration.  We  must  stress,  however,  that  we 
wish  to  build  on,  and  not  to  disparage,  the  very  important  work  of  Vardi  et  al.  (1985), 
without  which  the  present  paper  would  not  have  been  possible.  Indeed,  Vardi  et  al. 
themselves  suggested  that  some  smoothing  might  improve  PET  reconstructions. 

22.1  The  EMp  Algorithm 

In  order  to  provide  smoother  estimates  of  X  than  those  given  by  ML,  an  appealing 
approach  is  regularisation  or  penalised  maximum  likelihood  (see,  for  example.  Silver- 
man,  1985b,  and  Titterington,  1985b):  instead  of  maximising  !(njA),  maximise 

/(n|A)  -  R(X).  (2.3) 

Here,  -R(X)  might  be  interpreted  as  a  log  prior  density  for  Jl  in  a  Bayesian  framework 
or  as  a  penalty  term  which  discourages  roughness  in  a  penalised  likelihood  approach. 
Choosing  X  to  maximise  (2.3)  can  in  principle  be  achieved  by  EM  methods  too,  as 
noted  by  Dempster  et  al.  (1977),  to  give,  say,  an  EMp  algorithm: 

E  STEP  as  for  EM, 

Mp  STEP 

find  A(l+1)  by  maximising  ££[  k(s,t)log{X(s)p(s,t)}  -  X(s)p(s,t )]  -  R(X). 

S  t 

Repeating  E  and  Mp  steps  affords  convergence  to  a  maximum  penalised  likelihood 
solution  as  required;  what  is  more,  convergence  can  be  expected  to  be  rather  quicker 
than  the  basic  EM. 

Computational  considerations,  however,  militate  against  performing  the  Mp  step 
at  each  iteration  of  an  EMp  algorithm;  comparison  with  the  trivial  M  step  of  EM  for 
Poisson  likelihoods  emphasises  the  extra  burden.  The  Mp  step  involves  a  full  penalised 
likelihood  reconstruction  for  the  case  where  the  data  depend  on  the  intensity  function 
of  interest  through  a  Poisson  likelihood.  In  any  context  where  A  is  a  pixel  image,  the 
important  work  of  Greig,  Porteous  and  Seheult  (1986)  casts  doubt  on  the  existence  of 
any  method  at  present  for  actually  achieving  the  penalised  likelihood  solution,  although 
of  course  it  would  be  an  interesting  avenue  for  investigation  to  apply  the  image  pro¬ 
cessing  methods  of  Besag  (1986)  in  an  Mp  step. 


23  The  EMS  Algorithm 

Our  proposed  approach  is  slightly  ad  hoc,  but  is  very  straightforward.  We  suggest 
introducing  a  further  step  which  smooths  the  result  of  E  and  M  steps  in  a  simple  way. 
This  gives  an  EMS  algorithm: 

E  STEP  as  for  EM, 

M  STEP  as  for  EM,  except  call  the  output  say, 

S  STEP 

smooth  to  give 

For  the  problems  of  interest  in  this  paper,  this  becomes  the  iteration:  update  by  (2.2) 
and  smooth.  (If,  as  in  the  PET  application  of  Section  4,  the  reconstruction  bin  sizes 
{a(r)}  differ,  apply  the  smoother  to  the  /t(,+1)(r)/a(r)  values  then  multiply  the 
resulting  values  by  the  corresponding  a's  to  get  jl*,+1\)  This  EMS  approach  is  the 
major  tool  used  throughout  the  rest  of  the  paper. 

Choice  of  appropriate  smoothing  method  is  problem  dependent  and  will  be  con¬ 
sidered  in  Sections  3  and  4  although  it  turns  out  that  similar  methods  in  both  contexts 
prove  useful  even  though  the  perception  of  what  constitutes  "smooth"  is  somewhat  dif¬ 
ferent  in  the  two  problems.  Sensible  smoothing  schemes  should  retain  automatic  non¬ 
negativity.  We  no  longer  have  an  appealing  direct  interpretation  of  a  reconstruction 
obtained  by  EMS  in  terms  of  the  solution  of  a  specified  optimization  problem, 
although  the  work  of  Section  5  yields  a  heuristic  relationship  with  such  an  approach. 
This  backs  up  our  empirical  evidence  which  suggests  that  sensible  smoothing  regimes 
allow  the  EMS  algorithm  to  converge,  and  that  at  an  increased  rate  compared  with 
EM,  due  to  the  smoothing.  Moreover,  it  seems  from  our  empirical  experience  that  we 
can  expect  convergence  to  a  unique  solution. 

3.  A  FIRST  APPLICATION  :  STEREOLOGY 

3.1  The  Prob'rm 

A  classical  problem  in  stereology  is  the  following.  A  three-dimensional  specimen 
consists  of  some  translucent  material  in  which  arc  situated  a  number  of  opaque  non- 
overlapping  spheres.  Interest  centres  on  the  size  distribution  of  these  spheres;  in  an 
example  considered  later,  they  represent  tumours  in  the  liver  of  a  mouse.  It  is  not  pos¬ 
sible  to  observe  the  three-dimensional  internal  structure  directly.  Rather,  a  thin  slice  is 
taken  through  the  specimen  at  some  random  orientation.  When  this  section  is  exam¬ 
ined,  usually  under  a  microscope,  a  number  of  circles  is  observed,  each  corresponding 
to  a  slice  through  one  of  those  spheres  which  happen  to  be  cut  by  the  section.  Our  aim 
is  to  recover  the  intensity  of  the  radii  of  the  spheres  in  the  medium  from  this  sample 
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of  circle  radii. 

We  make  the  standard  assumption  (that  cannot  be  more  than  approximately  true) 
that  the  centres  of  the  spheres  are  distributed  according  to  a  three-dimensional  Poisson 
process  with  constant  intensity.  The  sphere  radii  are  bounded  above  by  Y,  say,  a  con¬ 
stant  determined  by  the  practical  context.  A  further  practical  constraint  introduces  a 
lower  bound  e,  say;  circle  radii  smaller  than  e  cannot  physically  be  observed.  Thus,  we 
are  concerned  with  a  truncated  circle  radius  intensity  fix),  x  e  X  =  [e,Y\,  and  seek  to 
reconstruct  a  similarly  truncated  version  g(y ),  y  e  Y  =  X,  of  the  sphere  radius  intensity. 
Both  e  and  Y  are  assumed  known.  The  relationship  between  /  and  g  can  be  written  in 
a  form  directly  comparable  with  (1.1): 

fix)  -  c£  j  ^lJt,2yi-TV'  g(y)  <ty-  (3-1) 

e  (y  -x2)* 

Here,  I$iy)  is  the  indicator  function  (1  if  y  s  <t>,  0  otherwise)  and  cE  is  a  constant.  This 
equation  was  first  derived  by  Wicksell  (1925)  for  the  case  e-0  arid  extended  to  e  *  0 
in  many  subsequent  papers  (see  Cruz-Orive,  1983).  The  ill-posedness  of  the  kernel 
function  in  (3.1)  arises,  intuitively,  because  circles  of  a  given  radius  can  be  obtained 
from  sections  through  spheres  of  any  radius  larger  than  that  observed. 

Discretisation  of  (3.1)  proceeds  exactly  as  set  out  in  Section  2.  All  quantities 
defined  there  transfer  directly  to  the  stereology  context  and  we  retain  the  same  notation 
in  this  section;  j-bin  quantities  are  now  to  do  with  sphere  radii,  r-bins  with  circle  radii. 
In  the  real  data  example  treated  briefly  in  Section  3.4,  circle  radii  were,  indeed, 
recorded  in  binned  form  only;  these  bins  and  our  reconstruction  bins  are  all  of  equal 
width.  The  form  of  the  kernel  in  (3.1)  allows  exact  computation  of  the  p(s,t)’ s  in  a 
straightforward  manner. 

Alternative  approaches  to  estimating  g  in  (3.1)  are  reviewed  by  Cruz-Orive  (1983) 
and  Nychka  et  al.  (1984).  Cruz-Orive  (1983)  also  discusses  some  other  practical  diffi¬ 
culties  which,  for  simplicity,  we  have  omitted.  Not  all  that  many  previously  proposed 
solutions  have  been  statistical  in  nature  and,  of  these,  very  few  have  resulted  from  a 
nonparametric  approach.  A  notable  exception  is  the  method  proposed  by  Nychka  et  al. 
(1984)  which  is  discussed  in  Section  3.3;  this  paper  inspired  much  of  the  simulation 
and  practical  work  reported  here.  Further  details  of  the  application  of  the  EMS  algo¬ 
rithm  to  the  stereology  problem  and  more  empirical  evidence  are  reported  in  Wilson 
(1987). 


32  EM  and  EMS  Reconstructions  for  Simulated  Data 

In  this  section,  we  apply  EM  and  EMS  algorithms  to  simulated  data  from  the 
stereology  problem.  Following  Nychka  et  al.  (1984),  we  chose  £  =  0.04  and  K  =  0.4 
and  considered  two  particular  choices  of  g :  appropriately  truncated  versions  of  a 


-9- 


Weibull  density,  g(y)  —  a0y^~l  exp(-ay^) ,  with  parameters  a  =  12  and  yS  =  0. 1,  and 
of  a  mixture  of  two  normals,  one  with  mean  0.15  chosen  with  probability  0.7,  the 
other  with  mean  0.275  and  both  with  standard  deviation  0.03.  This  scaling  (in  millim¬ 
eter  units)  and  the  Weibull  density  were  chosen  to  imitate  theoretically  the  real  data 
situation  described  in  Section  3.4;  this  is  Nychka  et  al.'s  "Experiment  1".  The  bimodal 
normal  mixture  follows  "Experiment  3"  of  Nychka  et  al.  and  was  chosen  to  test  the 
ability  of  the  reconstruction  methods  to  recover  distinct  peaks  in  an  intensity.  It  is  not 
difficult  to  generate  data  (from  f)  by  mimicking  the  physical  process:  choose  a  candi¬ 
date  sphere  radius  from  the  distribution  with  density  g,  decide  whether  this  sphere  was 
cut  by  a  random  section  using  an  acceptance  /  rejection  technique  (resulting  in  a  sphere 
radius  from  the  length-biased  distribution  corresponding  to  g)  and  then  determine  the 
corresponding  circle  radius  by  slicing  the  sphere  at  a  uniformly  distributed  distance 
from  its  centre.  For  further  details,  see  Wilson  (1987).  Again  to  be  roughly  compar¬ 
able  with  the  work  of  Nychka  et  al.  (1984),  we  generated  an  average  of  190  circle 
radii  in  each  simulation  of  the  Weibull  case  and  330  for  the  normal  mixture.  These 
data  were  grouped  into  T  =  50  bins. 


3.2.1  £a/  Reconstructions 

Typical  ML  estimates,  using  S  =  50  reconstruction  bins,  are  shown  in  Figs  3.1  and 
3.2  for  the  Weibull  and  normal  mixture  cases,  respectively.  In  these  and  all  remaining 
figures  in  Section  3,  g  is  represented  by  a  broken  line  and  the  estimate  of  g  by  a  solid 
line.  The  spiky  nature  of  these  EM  reconstructions  has  already  been  alluded  to;  Figs 

3. 1  and  3.2  are  genuinely  representative  of  the  kind  of  reconstruction  always  preferred 
by  ML  and  are  clearly  unacceptable  as  estimates  of  g.  Incidentally,  in  this  instance 
early  termination  of  the  EM  algorithm,  even  though  started  from  a  uniform  initial  allo¬ 
cation,  is  not  an  effective  remedy. 


3.22  Local  Smoothing 

Smoothness  of  an  intensity  function  can  be  defined  in  a  number  of  ways.  For 
current  purposes,  however,  a  heuristic  notion  of  smooth  as  (in  binned  form)  values  in 
neighbouring  bins  "differing  little"  will  suffice.  We  propose  using  a  very  simple 
smoother  to  have  this  effect;  we  claim  no  "optimality"  properties  for  our  choice,  but 
appeal  to  its  practical  effectiveness  and  simplicity  as  justification  for  its  use.  The 
scheme  is  a  weighted  average  of  a  bin  value  and  the  values  of  its  nearest  neighbours, 
using  binomial  weighting  factors.  Recalling  the  notation  used  in  the  definition  of  the 
EMS  algorithm  in  Section  2,  we  set 


A(i+1)(j) 


/t(,+1)(s+r). 
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Typically,  such  /-point  smoothing,  where  /  =  2y'+l,  is  used  for  /  =  3, 5, 7, 9  or  perhaps 
11;  the  greater  is  /,  the  more  smoothing  is  applied.  Various  modificadons  are  possible 
at  the  ends  of  the  range  of  bins;  Wilson  (1987)  describes  the  one  used  here. 

32 3  EMS  Reconstructions 

In  this  form,  EMS  maintains  the  EM  property  of  automatically  scaling  successive 
estimates  so  that  £A(,)(.r)  =  N  for  i  2  1,  where  V =£n(r)  is  the  total  number  of  cir- 

S  t 

cles  in  the  section.  We  then  normalise  and  join  the  estimated  values  at  bin  midpoints 
by  straight  lines  to  obtain  a  frequency  polygon,  calling  this  £;  it  is  the  density  estimate 
displayed  in  the  figures. 

Employing  the  EMS  algorithm  with  S-point  smoothing  to  the  normal  mixture  data 
which  gave  rise  to  Fig.  3.2  produces  Fig.  3.3.  The  improvement  in  quality  of  recon¬ 
struction  with  the  introduction  of  smoothing  is  strikingly  dramatic.  Indeed,  this  EMS 
reconstruction  provides  an  excellent  estimate  of  g. 

Not  all  EMS  estimates,  however,  provide  quite  such  good  reconstructions.  To 

measure  the  discrepancy  between  g  and  g,  we  essentially  use  the  Lx  distance 
r 

j  I  g(y)  -  g(y)  I  dy,  approximated  by 
€ 

a  =  \gs-gs\, 

1 

where  gs  and  gs  are  the  values  of  g  and  g  at  the  midpoints  of  the  s- bins,  and 
b  =  S~l(Y-e)  is  the  bin  width.  In  all,  ten  different  datasets  (of  essentially  the  same 
size)  were  generated  from  the  normal  mixture  model  and  EMS  reconstructions  (with 
7  =  5)  performed.  According  to  A,  £  of  Fig.  3.3  is  the  second  best  of  the  ten 
( A  =0. 1251),  the  best  having  A  =0.1198  and  the  worst  corresponding  to  A  =  0.3243. 
This  worst  reconstruction  is  shown  in  Fig.  3.4.  One  striking  feature  in  this  picture  is 
the  poor  behaviour  of  g  near  e.  This  effect  was  observed  in  a  minority  of  cases  and 
appears  to  be  due  to  an  inherent  numerical  and  statistical  instability,  possibly  con¬ 
nected  with  the  lack  of  information  at  small  circle  radii.  Nychka  et  al.  (1984)  noted  the 
same  phenomenon;  Wilson  (1987)  shows  that  the  difficulty  sometimes  disappears  if  the 
data  are  re-binned.  The  other  disappointing  aspect  of  this  g  is  its  behaviour  where 
there  is  a  trough  in  g\  having  said  that,  there  is  certainly  still  some  hint  of  the  underly¬ 
ing  bimodality  or,  at  least,  of  a  strong  indication  that  g  is  not  unimodal.  Most  of  the 
ten  simulated  datasets  resulted  in  rather  better  estimates  of  g,  however. 

A  further  important  advantage  of  the  EMS  algorithm  over  basic  EM  is  also  well 
illustrated  by  these  simulations,  namely,  an  enormous  improvement  in  the  computer 
time  taken  to  reach  the  solution.  Using  the  convergence  criterion  "stop  as  soon  as 


||;i(i+1)_jjt(i)||2  <  iQr6  ||A<‘)||2  ",  the  EM  reconstruction  of  Fig  3.2  took  484  iterations 
to  complete;  the  EMS  reconstructions  of  Figs  3.3  and  3.4  took  just  39  and  29  itera¬ 
tions.  respectively.  Since  the  binomial  smoothing  step  adds  only  a  very  small  extra 
computational  burden  to  each  iteration,  these  savings  are  impressive.  Uniqueness  of 
EMS  reconstructions  also  seems  to  hold:  in  experiments  with  different  starting  confi¬ 
gurations,  we  have  never  obtained  more  than  a  single  solution  per  dataset. 

We  have  not  considered  automatic  choice  of  the  smoothing  parameter  7;  rather,  a 
more  subjective  approach  has  been  found  to  work  well.  Reconstructions  using  7  =  3, 
then  7  =  5,  7  etc.,  can  be  looked  at  in  turn,  the  process  stopping  when  major  features 
in  the  estimate  start  to  disappear.  In  practice,  only  a  very  few  (at  most  4  or  5)  such 
reconstructions  need  to  be  calculated;  that  even  this  is  not  computationally  over¬ 
demanding  follows  from  the  excellent  convergence  rates  discussed  above. 

An  EMS  reconstruction  in  the  Weibull  case  is  shown  in  Fig.  3.5;  9-point  smooth¬ 
ing  turned  out  to  be  suitable  here.  Fig.  3.5  is  based  on  the  same  dataset  as  the  EM 
reconstruction  of  Fig.  3.1;  the  vast  improvement  brought  about  by  the  smoothing  is 
again  impressive.  Ten  datasets  were  simulated  in  this  case,  too;  Fig.  3.5,  with 
A  =0.1947,  is  only  the  seventh  best  estimate  of  these,  thus  demonstrating  that  a  good 
correspondence  between  true  and  estimated  densities  is  quite  typical  of  our  Weibull 
reconstructions.  Even  in  the  worst  cases,  the  estimate  of  the  density’s  tail  is  pleasingly 
accurate  and  the  reconstructions  always  indicate  an  increase  in  density  near  c,  serious 
discrepancies  arise  only  in  the  estimate  of  the  magnitude  of  this  effect.  The  EM  recon¬ 
struction  of  Fig.  3. 1  took  328  iterations  to  arrive  at;  typically,  EMS  reconstructions  — 
here  with  a  greater  degree  of  smoothing  than  in  the  normal  mixture  case  —  took  fewer 
than  20  iterations  each  to  converge. 

33  Remarks  on  Nychka  et  al.  (1984) 

The  reconstructions  of  Section  3.2.3  can  be  compared  with  those  of  Figs  7  and  9 
of  Nychka  et  al.  (1984).  The  immediate  impression  is  of  a  broad  similarity  of  the 
results  of  the  two  approaches;  that  our  reconstructions  are  certainly  no  worse  than 
Nychka  et  al.’ s  is  important,  since  we  believe  that  the  EMS  approach  of  this  paper  has 
several  advantages  over  the  cross-validated  spline  approach  of  Nychka  et  al.  (1984). 
Nychka  et  al.  take  a  regression  approach  to  what  is  a  density  estimation  version  of  the 
integral  equation  problem.  This  is  done  by  treating  the  data  histogram  values  as  if  they 
were  values  of  the  intensity  function  /  observed  with  error.  Some  justification  for  this 
is  to  appeal  to  the  asymptotic  result  that  the  "error  tenns"  will  have  zero  mean,  be 
jointly  normal  and  weakly  correlated;  the  latter  correlation  and  unequal  error  variances 
were  then  ignored.  The  usual  penalised  least  squares  approach  to  such  problems  could 
then  be  applied  with  the  value  of  the  smoothing  parameter  involved  chosen 
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automatically  by  the  well-known  method  of  generalised  cross-validation  (see  Nychka  et 
al.,  1984,  for  references).  The  advantages  we  perceive  for  our  EMS  algorithm  over 
Nychka  et  al' s  approach  are  its  computational  simplicity  and  speed,  its  more  natural 
incorporation  of  the  non-negativity  constraint,  and  the  fact  that  it  attacks  the  Poisson 
likelihood  directly. 


3.4  A  Real  Data  Example 

The  result  of  applying  the  EMS  algorithm  (with  7  =  5)  to  some  mouse  liver  data 
considered  by  Nychka  et  al.  (1984)  is  given  in  Fig.  3.6.  This  reconstruction  arises  from 
a  section  through  the  liver  of  a  mouse  in  which  there  are  a  number  of  malignant 
micro  tumours  induced  by  injection  of  a  carcinogen.  A  total  of  154  tumour  cross- 
sections  were  observed;  we  took  e  =  0.038,  f  =  0.51  (although  the  plot  stops  at  0.4; 
beyond  this,  g  =  0),  T  =  64  and  S  =  50.  Fig.  3.6  is  directly  comparable  with  Fig.  6  of 
Nychka  et  al.  (1984).  The  outstanding  feature  of  this  comparison  is,  once  again,  a 
remarkable  similarity  in  reconstructions  obtained  by  the  two  approaches.  We  have  pre¬ 
ferred,  perhaps,  a  little  less  smoothing  of  the  two;  any  minor  differences  can  be  largely 
attributed  to  this. 

This  particular  mouse  liver  was,  in  fact,  completely  dissected  and  the  histogram 
of  sphere  radii  found  is  also  shown  on  Fig.  6  of  Nychka  et  al.  (1984).  In  one  sense, 
this  forms  a  true  distribution;  comparing  the  reconstruction  with  the  histogram  reveals 
a  generally  good  agreement,  except  for  discrepancies  in  the  magnitude  and  slope  of  the 
density  near  e.  However,  this  comparison  is  not  entirely  fair  we  have  been  estimating 
a  (presumed)  smooth  density  of  malignant  tumours  in  mouse  livers,  of  which  the  histo¬ 
gram  is  itself  another  (unsmooth)  estimate,  albeit  based  on  a  much  larger  sample  of 
directly  observed  spheres. 

4.  A  SECOND  APPLICATION  :  POSITRON  EMISSION  TOMOGRAPHY 

4.1  The  Problem 

PET  is  a  medical  diagnostic  technique  that  studies  the  pattern  of  blood  flow  and 
metabolic  activity  in  an  organ  by  producing  an  indirectly  observed  image  of  a  planar 
section  through  the  patient’s  body.  Such  pictorial  representations  of  internal  structure 
have  considerable  appeal  as  a  means  of  diagnosing  certain  diseases  and  in  assessing 
the  effectiveness  of  treatments.  Some  of  the  material  of  this  section  is  a  general 
review  of  the  problem,  but  we  shall  suggest  several  technical  improvements  on  exist¬ 
ing  methodology  in  addition  to  the  use  of  our  smoothed  EM  procedure.  One  particular 
advance  is  a  new  discretisation  of  the  "body  space"  Y  which  affords  considerable  com¬ 
putational  economies;  see  Section  4.1.2. 
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PET  operates  as  follows.  A  radioactive  tracer  —  here,  a  substance  (often  glu¬ 
cose)  emitting  positrons  —  is  introduced  into  the  area  of  interest  and  the  occurrence 
of  these  emissions  is  recorded  by  an  array  of  detectors  arranged  around  the  body;  this 
apparatus  is  a  tomograph.  The  amount  of  radiation  given  off  at  any  point  reflects  the 
degree  of  activity  present  there,  so  the  overall  "emission  density"  provides  the  required 
portrait  of  internal  structure  which  we  estimate.  The  physics  of  PET  is  described  by 
Vardi  et  al.  (1985)  thus:  "When  a  positron  is  emitted,  it  ‘finds’  a  nearby  electron  and 
annihilates  with  it.  The  annihilation  creates  two  X-ray  photons  that  fly  off  the  point  of 
annihilation,  at  the  speed  of  light,  in  (nearly)  opposite  directions  along  a  line  with  a 
completely  random  (i.e.  uniformly  distributed  in  space)  orientation.  There  is  an  array 
of  discrete  detector  elements  surrounding  the  [area  of  interest],  and  the  two  photons 
are  detected  in  coincidence  by  a  pair  of  detector  elements  that  define  ...  a  tube,  t  hus 
the  only  information  acquired  when  a  pair  of  detectors  count  a  coincidence  is  that  the 
annihilation  occurred  somewhere  inside  the  tube  defined  by  the  two  ‘firing’  detectors". 
This  is  illustrated  in  Fig.  4.1;  see  also  Fig.  1  of  Vardi  et  al.  (1985)  or  Kaufman  (1987). 
Fig.  4.1  is  a  planar  view.  It  is  important,  however,  to  bear  in  mind  the  three- 
dimensional  nature  of  the'  emission  process  and,  consequently,  the  finite  "depth”  of  the 
detectors;  the  effect  of  this  (not  considered  by  Vardi  et  al.,  1985)  is  discussed  in  Sec¬ 
tion  4.1.4.  The  tube  counts  comprise  the  data  n.  Note  that  the  tube  space  X  differs 
from  the  body  space  Y. 

For  more  details  on  physical  aspects  of  PET,  see  Vardi  et  al.  (1985)  and  Hoffman 
and  Phelps  (1986).  PET  is  a  fairly  recent  innovation,  many  aspects  of  which  are  still 
at  the  development  stage.  Research  interest  in  PET  covers  several  disciplines;  see 
Phelps,  Mazziota  and  Schelbert  (1986)  for  an  up-to-date  account,  including  an  idea  of 
the  scope  of  medical  application.  Other  kinds  of  tomography  exist  Transmission 
tomography  has  had  more  impact:  X-ray  transmission  tomography  and  related  tech¬ 
niques  are  well-known,  but  are  mathematically  quite  distinct  from  PET  so  the  methods 
discussed  here  do  not  apply.  Our  methods  can  be  modified  for  use,  however,  with 
another  emission  technique  called  single  photon  emission  tomography  (SPECT)  which 
is  little  different,  as  far  as  mathematical  or  statistical  analysis  goes,  from  PET;  see,  for 
example,  Geman  and  McClure  (1985,  1987). 

4.1.1  The  Detectors 

There  are  a  number  of  detector  configurations  in  current  use  at  PET  installations. 
We  follow  Vardi  et  al.  (1985)  in  considering  a  single  stationary  circle  of  D  detectors, 
each  of  equal  size  and  with  no  gaps  between  them.  Without  loss  of  generality,  the  cir¬ 
cle  has  unit  radius.  This  is  shown  in  Fig.  4.1;  as  there  we  take  D  =  128  in  all  our 
illustrations  ( D  is  often  a  power  of  2).  Practical  variations  on  this  set-up  include  alter¬ 
native  detector  array  shapes,  gaps  between  detectors,  two  or  more  such  arrays  and 
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movement  of  detectors.  All  T  =  \D(D- 1)  pairs  of  detectors  form  the  tubes  or  data 
bins.  Although  the  PET  problem  is  a  bivariate  (spatial)  analogue  of  the  univariate 
application  of  Section  3,  it  will  remain  convenient  to  index  tubes  by  r  =  1  ,...,7*  in 
what  follows;  we  note  that  for  computational  purposes,  however,  the  spatial  location  of 
detector  tubes  is  best  described  by  a  polar  coordinate  system. 

4.12  Discretising  the  Disc 

The  space  Y  is  the  disc  enclosed  by  the  circle  of  detectors.  We  require  a  discreti¬ 
sation  of  this  disc  on  which  to  reconstruct  and  display  emission  densities;  discretised 
functions  are  piecewise  constant  taking  a  single  value  across  each  bin  or  pixel.  Many 
workers,  including  Vardi  et  al.  (1985),  simply  superimpose  a  square  grid  of  pixels  over 
Y,  but  this  approach  suffers  from  important  computational  disadvantages  compared 
with  discretisations  that  better  take  into  account  the  geometry  of  the  situation.  By 
more  properly  exploiting  circular  symmetries,  it  is  possible  to  make  substantial  savings 
in  both  storage  and  time  requirements.  In  order  best  to  represent  an  image  by  a  step 
function,  all  pixels  should  be,  at  least  approximately,  of  equal  area  and  shape. 

Suppose  we  allow  Dx  -  2*  divisions  of  the  detector  circle  into  arcs  of  equal 
length,  for  some  integer  k.  Then,  our  proposal  is  to  use  the  discretisation  shown  in 
Fig.  4.2,  constructed  as  follows.  First  divide  the  disc  into  R  -  Dj/4  equal- width  rings 
by  drawing  circles  of  radius  i/R,  i=l,2,. for  each  i,  set  y  =  [ log2n.  where  [x] 
denotes  the  largest  integer  strictly  less  that  x,  and  divide  ring  i  into  2,+3  pixels  of 
equal  size  and  shape.  Thus  the  pixellation  is  achieved  by  doubling  segmentations  of 
the  circle  at  appropriate  stages,  at  the  expense  of  introducing  "seams"  between  the  2/th 
and  (2,+l)st  rings,  y=0,l,...,k-3.  Except  for  the  innermost  ring  of  all,  each  pixel  is 
of  the  same  general  shape,  while  the  ratio  of  maximum  to  minimum  pixel  area  is 
strictly  less  than  2.  For  D  =  128,  the  choice  D\  =  D  yields  what  we  regard  to  be  too 
coarse  a  grid.  Rather,  we  employ  Dx  =  2D  pixels  in  the  outermost  ring  and  identify 
pairs  of  adjacent  pixels  with  detectors.  In  this  application,  s  =  1,...,5  refer  to  these 
pixels. 

Kearfott  (1985)  and  Kaufman  (1987)  also  recognise  the  advantages  to  be  gained 
by  using  such  a  "ring  grid".  Kearfott’s  (1985)  simple  discretisation  of  the  disc  results 
in  the  division  of  the  central  area  into  very  many  long  thin  pixels,  to  the  obvious  detri¬ 
ment  of  discretised  picture  quality.  Kaufman  (1987)  presents  a  discretisation  which 
overcomes  this  problem.  Kaufman’s  ring  grid  is,  however,  rather  less  easy  to  describe 
than  is  ours:  "The  ith  ring  is  divided  into  nt  sectors  so  that  /»,•  =  y4-  x  kt  where  y,  is  a 
small  integer  and  k,  is  a  divisor  of  D,"  but  there  appears  to  be  no  simple  scheme  for 
choice  of  these  values.  Further,  Kaufman  (1987)  uses  variable  ring  widths  —  although 
the  widths  "vary  no  more  than  about  25  percent"  —  to  obtain  pixels  of  equal  area;  it  is 
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not  clear  that  this  complication  is  worthwhile  when  pixels  vary  correspondingly  more 
in  shape.  A  full  description  of  Kaufman’s  pixellation  of  the  disc  thus  requires  specify¬ 
ing  values  of  j,  k  and  width  individually  for  each  ring.  For  D  =  128,  Kaufman’s 
(1987)  discretisation  results  in  12,884  pixels,  while  the  comparable  grid  in  Fig.  4.2  has 
rather  fewer  —  10,924;  each  has  R  =  64.  The  value  of  these  circularly  symmetric 
discretisations  for  computational  purposes  is  best  reflected  in  the  number  P  of  different 
possible  relationships  of  pixels  to  tubes,  modulo  rotations.  For  Kaufman’s  (1987)  grid, 
P  =  200;  for  ours,  P  =  R  =  64  —  just  one  per  ring.  These  numbers  compare  with 
P  =  2,080  for  a  comparable  square  grid  discussed  by  Kaufman  (1987). 

4.13  More  on  the  Problem 

We  have  no  real  data  from  a  PET  installation,  but  rather  seek  to  reconstruct  a 
relevant  mathematical  model  (or  mathematical  phantom)  of  an  image  using  simulated 
data.  The  phantom  we  use  is  (essentially)  that  of  Vardi  et  al.  (1985);  in  Fig.  4.3,  we 
present  a  grey-level  picture  of  that  phantom,  using  64  grey-levels  to  reflect  emission 
intensity  in  the  obvious  way.  This  image  is  designed  "as  a  simplified  imitation  of  the 
brain’s  metabolic  activity"  with  different  areas  representing  the  skull,  grey  matter, 
tumours  and  so  on.  Note  that  a  property  of  this  picture  is  that  the  emission  density 
consists  of  areas  of  constant  intensity  with  fairly  large  contrast  between  different  areas. 
Fig.  4.3  is,  of  course,  a  discretised  version  of  the  ideal  image  (Fig.  2  of  Vardi  et  al., 
1985),  pixels  overlapping  area  boundaries  being  regarded  as  having  a  weighted  average 
of  values  present,  in  (approximate)  proportion  to  area  of  pixel  covered.  Note  that  we 
actually  aim  to  reconstruct  this  discretised  emission  density,  and  denote  total  emissions 

in  pixeis  by  Ms),  s=  1 . S.  Also,  differing  pixel  areas  must  be  taken  into  account  in 

the  smoothing  and  plotting;  the  intensities  we  plot  are  <pf  =  X(s)l  a(s),  where  a(s)  is 
the  area  of  pixel  s. 

Positron  emissions  are  assumed  to  occur  uniformly  at  random  over  homogeneous 
regions,  but  at  appropriately  differing  rates  between  areas  of  dissimilar  material  i.e. 
they  occur  according  to  a  npnhomogeneous  spatial  Poisson  point  process  with  intensity 
function  the  emission  density.  The  unobserved  pixel  counts  are  m  and  k(s,t)  is  the 
number  of  emissions  occurring  in  pixel  s  which  are  detected  in  tube  t.  Vardi  et  al. 
(1985)  state  that  the  Poisson  process  assumption  "seems  beyond  challenge  and  requires 
no  justification"  in  the  PET  problem. 

The  discretised  kernel  function  becomes  the  probability  that  a  uniformly  orien¬ 
tated  line  through  y  in  pixel  s  intersects  the  two  detectors  defining  the  tube  r,  averaged 
over  all  y  in  pixel  s,  for  s=l,...,S,  t=\,...,T.  The  geometrical  problem  of  evaluating 
the  p’s  exactly  is  non-trivial;  we  propose  using  a  simple  approximation.  If  the  prob¬ 
lem  were  the  strictly  two-dimensional  one  suggested  by  Fig.  4.2,  the  basic  idea  is  to 
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use 


p(s,t ) 


D~l  if  the  centre  of  pixel  s  falls  in  tube  t, 
0  otherwise. 


(4.1) 


We  thus  gloss  over  small  variations  in  p’s  due  to  the  planar  geometry,  but  note  that 
there  is  an  important  effect  due  to  the  third  dimension  which  we  shall  discuss  in  Sec¬ 
tion  4.1.4.  The  computational  advantage  of  concentrating  on  the  centres  of  pixels  in 
our  disc  discretisation  is  great.  For  D  =  128  and  Dx  =  256,  we  need  only  store  the 
locations  of  8,192  nonzero  p’s  —  compared  with  27,378  reported  for  Kaufman’s 
(1987)  setup  —  and  save  considerably  on  computer  time  by  addressing  only  those 
terms  corresponding  to  nonzero  p’s  in  the  EM  update  (2.2). 

In  practice,  real  PET  apparatus  involves  numerous  further  important  practical 
aspects  including,  for  example,  time-of-flight  considerations  (non-coincident  arrival  at 
detectors),  attenuation  problems  ("soaking  up"  of  X-ray  photons)  and  scattering  (non- 
axial  photons);  see  Vardi  et  al.  (1985),  Kearfott  (1985)  and  Hoffman  and  Phelps 
(1986).  Some  of  these  effects,  such  as  time-of-flight  and  scattering,  involve  alterations 
only  to  the  p(s,t)' s,  so  our  general  methodology  would  carry  over  unchanged.  Non¬ 
linear  effects  like  attenuation,  where  the  p(s,t)’ s  depend  on  the  unknown  image, 
would  require  more  substantial  modifications. 


4.1.4  Accounting  for  the  Third  Dimension 

Photon  lines  are  emitted  in  directions  distributed  uniformly  in  3-dimensional 
space  and  the  detectors  have  a  finite  depth,  d,  say;  this  has  not  yet  been  taken  into 
account  Consider  a  tube  of  length  /,  say,  where  /  is  large  relative  to  d,  and  condition 
on  the  emission  being  in  the  direction  of  that  tube.  Suppose  the  annihilation  takes 
place  at  a  distance  /]  from  the  left  hand  detector  at  a  height  x  and  write  /2  =  l-l\  (take 
x<)d  and  /j  </2,  without  loss  of  generality);  see  the  cross-sectional  view  of  Fig.  4.4.  It 
is  natural  to  assume  that  x  is  uniformly  distributed  on  [0,£d];  this  reflects  an  assump¬ 
tion  that  d  is  small  enough  for  there  to  be  negligible  change  in  intensity  over  that  dis¬ 
tance.  The  contribution  to  p(s,t)  due  to  this  third  dimension  is  what  we  consider  here, 
namely,  the  probability  that  a  particular  emission  yields  a  photon  pair  that  hits  both 
detectors. 

Suppose  that  yr  is  the  angle  that  the  photon  line  makes  with  the  plane  of  the 
detectors.  Since  l»d,  only  small  y/’ s  can  occur,  so  that  y/  =  tan  yr  is  approximately 
uniformly  distributed  on  its  range  of  admissible  values.  Evaluating  this  range  is  not 
difficult.  First,  if  l~x l\d<x<J#l,  any  line  hitting  the  right  hand  detector  automatically 
also  hits  the  left  hand  one;  the  range  of  appropriate  y/’s  is  thus  approximately  d//2. 
Otherwise,  0£jc £l~xl\d  and  the  range,  which  is  governed  by  the  angles  allowed  by  the 
bottom  edges  of  both  detectors,  is  approximately  xll{l\lf).  Averaging  over  the 
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distribudon  of  x  yields  an  average  range  of  iff  which  corresponds  to  the  required  proba¬ 
bility.  A  simple  calculation  shows  that,  to  the  degree  of  approximation  used  above, 
this  probability  i s  dll. 

Now,  this  quantity  does  not  depend  on  where  in  the  tube  the  annihilation  takes 
place,  but  only  on 'the  length  /  =  /(/)  of  the  tube.  Thus,  p(s,t)  is  modified  by  a  fac¬ 
tor,  inversely  proportional  to  /(r),  depending  only  on  t,  and  not  s.  The  effect  on  the 
data  is  clear  a  smaller  proportion  of  emissions  occurring  towards  the  centre  of  the  disc 
will  be  detected  than  of  those  happening  near  the  edge,  with  a  consequent  degradation 
of  reconstruction  quality  to  be  expected  in  the  (important)  central  area.  That  this  third 
dimension  effect  remains  important  while  an  apparently  similar  effect  in  the  planar 
case  —  imagine  Fig.  4.4  as  the  view  looking  down  on  a  tube  in  Fig.  4.2  —  does  not, 
is  due  to  short  tubes  in  the  plane  also  becoming  thin  tubes  (d  decreases  with  /).  but 
retaining  their  depth  in  the  third  dimension. 

The  real  importance  of  the  third  dimension  lies  in  the  fact  that  the  3-dimensional 
problem  does  not  tend,  in  the  limit  as  d—*Q,  to  the  2-dimensional  one.  To  see  this, 
note  that  any  d> 0  results,  after  proper  normalisation,  in  an  identical  set  of  p(s,t)’ s; 
these  include  /(r),  the  2-dimcnsional  ones  do  not.  Since  the  real  PET  problem  is  3- 
dimensional,  our  approximating  to  that  case  is  much  preferable  to  approximating  the 
planar  situation  only.  Since  the  change  to  p(s,t)  depends  only  on  r,  the  extra  compu¬ 
tational  burden  imposed  by  taking  account  of  the  third  dimension  is  virtually  nil. 

42  EM  Reconstruction 

We  are  now  in  a  position  to  apply  the  EM  algorithm  for  ML  estimation  to  the 
PET  problem,  exactly  as  described  in  Section  2.  Shepp  and  Vardi  (1982)  were  the 
first  to  do  so;  Vardi  et  al.  (1985)  and  Kaufman  (1987)  follow  up  this  work  (see  also 
Lange  and  Carson,  1984).  The  uniform  initialisation  /  early  termination  version  of  EM 
which  is  actually  employed  is  widely  regarded  as  being  among  the  best  PET  recon¬ 
struction  procedures  currently  available;  see,  for  example,  Shepp  et  al.  (1984),  Mintun 
et  al.  (1985)  and  Vardi  et  al.  (1985),  the  last  named  outlining  several  competing 
reconstruction  methods.  Most  commercial  PET  installations  persist  in  using  other 
techniques  (especially  "convolution  back  projection",  see  Shepp  and  Kruskal,  1978) 
because  of  the  computational  advantage  such  approaches  afford  (Kaufman,  1987). 

A  dataset  arising  from  the  phantom  of  Fig.  4.3  was  simulated;  all  reconstruction 
attempts  to  be  portrayed  in  succeeding  figures  are  based  on  these  data.  In  line  with 
many  other  studies,  a  total' number  of  emissions,  N,  of  106  was  chosen  (this  is,  how¬ 
ever,  rather  fewer  than  the  number  employed  by  Vardi  et  al.,  1985,  and  Kaufman, 
1987).  Data  simulation  was  again  performed  by  mimicking  the  physical  process: 
obtain  points  from  the  Poisson  process  with  the  intensity  displayed  in  Fig.  4.3  by  the 
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obvious  acceptance  /  rejection  method,  obtain  randomly  oriented  lines  through  these 
points  by  choosing  uniformly  distributed  angles  and  finally  perform  a  further 
acceptance  /  rejection  step  with  acceptance  probabilities  inversely  proportional  to  the 
lengths  of  these  lines  to  take  the  third  dimension  into  account 

A  reconstruction  of  the  phantom  obtained  by  allowing  the  EM  algorithm  to  run 
for  some  considerable  time  —  here,  200  iterations  —  is  shown  in  Fig.  4.5.  The  result 
is  typical  of  the  unacceptability  of  "ML  reconstructions"  in  this  context  The  image 
obtained  is  itself  very  noisy:  putative  flat  areas  are  estimated  to  be  extremely  rough. 
As  well  as  the  lack  of  aesthetic  appeal,  the  effect  of  this  is  that  only  the  very  strongest 
features  —  here,  the  large  circle  and  ellipse,  both  with  very  different  intensity  from  the 
background  —  survive  for  inspection;  this  is  clearly  unsatisfactory.  The  speckled 
nature  of  Fig.  4.5  reflects  the  roughness  of  the  reconstructed  surface  in  plan  view; 
neighbouring  pixels  arc  estimated  to  have  widely  differing  intensities.  It  is  important 
to  note  that  the  EM  algorithm  has  not  yet  converged  and  the  roughness  described  here 
gets  worse  if  we  allow  ti^  algorjibrn  j9^nmJurthctL_This._is_bccause  .ML  istrying^  to 
suggest  a  "spiky"  solution  to  the  problem,  much  as  in  Figs  3.1  and  3.2  for  the  univari¬ 
ate  analogue,  this  effect  being  mitigated  here  by  the  smoothing  due  to  the  discretisa¬ 
tion  of  the  disc.  The  same  grey  scaling  is  used  on  all  reconstructions.  The  great  varia¬ 
bility  in  Fig.  4.5  implies  that  in  the  darker  areas,  some  estimated  pixel  intensities  lie 
above  the  highest  grey  level  and  have  been  redefined  to  be  black;  some  of  the  speck¬ 
led  nature  of  the  picture,  especially  on  the  largest  circle,  has  thus  been  concealed. 

Of  course,  in  practice,  application  of  the  EM  algorithm  is  not  allowed  to  reach  a 
state  like  that  of  Fig.  4.5.  Rather,  the  iterations  are  terminated  early:  Fig.  4.6  displays 
the  reconstruction  obtained  by  stopping  after  just  24  EM  iterations.  Calling  this 
(erroneously)  "the  ML  reconstruction"  accounts  for  the  good  performance  attributed  to 
the  method:  in  Fig.  4.6,  large  "objects"  are  well  reconstructed  and  roughness,  com¬ 
pared  with  Fig.  4.5,  is  considerably  reduced.  (Small  scale  features  present  in  the  phan¬ 
tom  are  hinted  at,  if  not  reproduced  convincingly.)  Veklerov  and  Llacer  (1987)  propose 
a  data-dependent  rule  for  selecting  the  point  at  which  early  termination  of  the  EM 
algorithm  should  occur.  The  use  of  a  constant  initial  configuration  is  important  here; 
it  is  a  smoothing  influence  which  persists  through  the  early  stages  of  the  EM  algo¬ 
rithm.  Roughly  speaking,  early  iterations  quickly  make  manifest  approximate  shapes 
and  intensities  of  objects,  while  the  later  iterations  are  responsible  for  roughening  the 
image.  The  uniform  starting  point  is  the  ultimate  in  smooth  images  in  the  sense 
appropriate  to  PET.  So,  rather  than  the  common  approach  of  smoothing  a  rough 
image  towards  such  a  smooth  one,  the  EM  iterations  are  used  to  "roughen”  away  from 
the  ultrasmooth. 

Different  choices  of  initial  estimate  result  in  different  EM  reconstructions;  a  vivid 
illustration  of  how  properties  of  initial  reconstruction  can  persist  to  appear  in  iterated 
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rcconstructions  is  given  in  Fig.  5  of  Kaufman  (1987).  Extensive  recent  work  on 
accelerating  convergence  of  the  EM  algorithm  (Lewitt  and  Muehllehner,  1986,  Kauf¬ 
man,  1987,  Lange,  Bahn  and  Little,  1987)  seems  gratuitous  since,  as  we  have  argued, 
the  ML  optimisation  is  inappropriate  to  the  problem  at  hand. 

All  our  reconstructions  incorporate  the  third  dimension  effect  described  in  Section 
4.1.4.  These  turn  out  to  be  slightly  smoother  than  comparable  reconstructions  of  the 
purely  two-dimensional  version;  the  length  bias  has  the  effect  of  making  the  problem 
less  ill-posed.  As  anticipated  in  Section  4.1.4,  there  is  a  slight  deterioration  in  quality 
of  reconstruction  towards  the  centre  of  the  image;  perhaps  it  would  be  more  realistic  to 
suppose  the  area  of  interest  filled  a  smaller  portion  of  the  tomograph  disc,  whence 
such  an  effect  might  become  more  important  Note  also  that  we  might  expect  the 
incorporation  of  more  physical  considerations  into  the  p(s,t)’ s  to  result  in  less  smooth 
reconstructions  than  here,  since  most  would  have  a  smoothing  effect  on  the  kernel  and 
a  consequent  worsening  of  the  ill-posed  nature  of  the  problem. 

43  Smoothed  EM  Reconstruction 


4.3.1  Local  Smoothing 

We  utilise  perhaps  the  most  natural  (and  common)  approach  to  smoothing  values 
on  a  spatial  grid:  replace  the  value  at  each  pixel  by  some  function  of  that  value  and 
those  of  its  nearest  neighbours.  Examples  of  useful  smoothing  functions  follow  in 
Section  4.3.2.  A  little  care  needs  to  be  taken  over  the  definition  of  neighbours  in  our 
circular  discretisation  scheme.  For  a  rectangular  discretisation,  Besag  (1986,  p.262), 
for  example,  identifies  nearest  neighbours  of  a  pixel  in  a  natural  way:  first-order  neigh¬ 
bours  are  those  pixels  adjacent  vertically  and  laterally  to  the  one  of  interest,  while  a 
second-order  neighbourhood  additionally  includes  diagonal  adjacencies.  The  effect  of 
a  finite  window  is  to  modify  these  definitions  (in  an  obvious  way)  for  boundary  and 
comer  pixels.  It  is  not  difficult  to  translate  these  notions  to  the  circular  grid  although, 
because  of  the  seaming,  we  need  deal  with  8  (rather  than  3)  different  pixel  types.  First 
and  second  order  neighbours  of  pixels  of  each  type  are  identified  in  Fig.  4.7  (using 
Dx  =64  for  clarity).  These  definitions  retain  the  desirable  property  of  symmetry  of 
neighbour  pairs:  if  Sj  is  a  neighbour  of  s2 ,  $2  IS  a  neighbour  of  s{ .  Following  Besag 
(1986),  we  view  the  second  order  scheme  as  the  most  useful  one  (and  use  it 
throughout).  A  further  argument  here  for  going  beyond  first-oTder  neighbourhoods  is 
that  alternate  pixels  on  the  outside  of  a  seam  have  different  types  of  adjacency  on  the 
inner  edge;  this  leads  to  an  undesirable  "casteilation”  effect  on  reconstructions  using 
first-order  neighbours  only. 
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Unlike  the  basic  EM  algorithm,  boundary  and  seam  effects  mean  that  EMS  algo¬ 
rithms  do  not  automatically  scale  so  that  £A{l)(.r)  =  N,  all  tel.  Operationally,  we 

s 

rescale  smoothed  images  to  have  this  property;  although  this  has  no  effect  on  succes¬ 
sive  reconstructions,  it  is  useful  in  making  successive  values  of  the  (log)  likelihood 
comparable. 

4.32  Smoothed  Reconstructions 

In  this  section,  we  present  some  examples  of  applying  versions  of  the  smoothed 
EM  algorithm  to  reconstructing  the  phantom  of  Fig.  4.3.  Because  the  phantom,  as 
well  as  images  likely  to  occur  in  practice,  is  not  everywhere  smooth  but  contains 
discontinuities,  we  have  experimented  with  both  simple  linear  smoothers  and  with 
local  non-linear  ones.  Smoothers  that  purport  to  preserve  edges  are  necessarily  non¬ 
linear  in  the  values  at  the  pixel  of  interest  and  its  neighbours;  Scher  et  al.  (1980)  and 
Chin  and  Yeh  (1983)  describe  a  number  of  methods  that  have  been  proposed  in  the 
literature  for  use  in  cleaning  up  noisy  images  containing  discontinuities.  However,  the 
performance  of  non-linear  smoothers  within  EMS  has  beer  disappointing.  We  report 
reconstructions  based  on  just  two  of  these  smoothing  schemes  out  of  the  several  we 
have  tried.  In  Fig.  4.8,  we  exhibit  the  result  of  using  the  EMS  algorithm  with  local 
median  smoother  i.e.  replacing  a  pixel  value  by  the  median  of  it  and  its  neighbours’ 
values.  In  Fig.  4.9,  a  slightly  more  sophisticated  non-linear  smoother  —  the  best  we 
have  used  in  this  context  —  was  deployed.  This  is  the  mean  of  the  central  pixel  value 
and  of  the  two  neighbouring  values  closest  to  the  central  one;  in  this  way,  we  try  to 
average  only  over  pixels  on  "the  right  side"  of  an  edge  (this  is  a  special  case  of  KAVE 
of  Chin  and  Yeh,  1983).  Neither  of  these,  nor  any  others  that  we  have  investigated, 
yields  a  good  reconstruction.  As  well  as  eradication  of  the  smaller  features  of  the 
phantom,  significant  distortions  are  introduced  as  artefacts  of  the  methods  used.  Both 
Figs  4.8  and  4.9  are  pictures  produced  after  200  EMS  iterations.  It  is  important  to 
note  that  neither  of  these  uon-linear  methods  converged. 

Returning  to  linear  local  smoothers  —  and  thus  relaxing  our  concern  for  trying  to 
avoid  blurring  feature  edges  —  we  get  better  results.  A  simple  scheme,  which  works 
well,  is  this:  take  a  weighted  average  of  the  form  weight  1  for  the  central  value  and 
equal  weights  W~l,  say,  for  each  neighbouring  value,  normalised  appropriately  (other 
linear  smoothing  possibilities  are  in  Russ  and  Russ,  1984).  This  is  closely  related  to 
the  way  we  smoothed  in  the  stereology  context.  It  turns  out  that  we  need  only  a  small 
amount  of  smoothing  (W  large)  for  good  effect.  Reconstructions  for  the  ongoing 
example  are  given  using  W  =  200,  50  and  10  in  Figs  4.10,  4.11  and  4.12,  respectively. 
The  first  of  these  reflects  the  effect  of  (slightly)  undersmoothing:  background  rough¬ 
ness  remains  too  high,  although  objects  are  fairly  clearly  visible.  The  last  is  a  little 
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oversmoothed:  better  background  but  loss  of  resolution  in  object  reconstruction.  The 
choice  W  =  50  in  Fig.  4.11  seems  to  be  about  as  good  a  compromise  as  can  be 
obtained  by  this  method.  We  have  not  considered  automatic  choice  of  smoothing 
parameter,  but  are  encouraged  by  the  fact  that  "best”  choosing  W  might  not  be  critical: 
reconstructions  (not  shown)  using  W  between  say  100  and  25  are  not  substantially  dif¬ 
ferent  from  that  of  Fig.  4.11.  Note  that  in  virtually  all  of  our  reconstructions  a  minor 
effect  due  to  the  seam  in  our  discretisation  with  radius  one  half  that  of  the  disc  is 
faintly  visible.  In  particular,  this  artefact  has  had  a  slightly  detrimental  effect  on  the 
quality  of  reconstruction  of  the  pair  of  small  ellipses  towards  the  bottom  of  the  phan¬ 
tom  which  tie  near  to  this  seam. 

Quantifying  reconstruction  quality  in  image  analysis  is  not  easy.  We  briefly  report 
Lx  discrepancies  between  phantom  and  reconstructions;  this  quantity  is 

• B  =  £a(s)  |  A(s)  -  A(s)  j 

S-\ 

A  A 

where  A  and  A  are  grey  scale  values  corresponding  to  A  and  A  respectively.  Now, 
B  =  6. 191  for  our  W  =  50  reconstruction,  although  smaller  values  of  B  are  achieved  for 
smoother  pictures:  B  =  5.728  for  W  =  25  is  the  best  achieved.  Our  visual  preferences 
are  better  reflected  in  other  L{  values,  though:  Fig.  4.5  yields  the  very  large  value 
B  =  20.844,  Fig.  4.9  is  just  preferable  (B  =  9.825)  to  Fig.  4.8  (B  =  10.222)  and  is 
much  preferable  to  other  non-linear  EMS  solutions,  and  the  reconstruction  of  Fig.  4.6 
after  24  EM  iterations  is  quite  good  with  B  =  6.910  (this  is  comparable  with  EMS 
reconstructions  with  W  between  75  and  100).  It  is  noticeable  that  B  displays  a  prefer¬ 
ence  for  oversmoothed  images  but  is  otherwise  satisfactory.  In  any  case,  it  is  widely 
recognised  that  this  type  of  measure  does  not  really  give  a  good  reflection  of  the 
human  observer’s  sense  of  image  fidelity  especially  when,  as  here,  the  true  image  con¬ 
tains  features  with  distinct  edges.  Indeed,  the  provision  of  image  metrics  that  properly 
reflect  visual  perception  remains  a  difficult  question:  see  Baddeley  (1987),  for  exam¬ 
ple. 

Finally,  the  EMS  algorithm  using  local  linear  smoothing  has  always  converged  in 
a  reasonable  number  of  iterations.  Indeed,  using  a  convergence  criterion  essentially 
corresponding  to  that  in  Section  3.2.3,  the  numbers  of  iterations  required  for  conver¬ 
gence  of  EMS  with  W  =  200  ,  50  and  10  were  62,  43  and  32,  respectively.  Moreover, 
simulation  experience  suggests  that  the  local  linear  EMS  reconstruction  is  unique. 

4.3  J  Other  Smoothed  EM  Methods  in  Emission  Tomography 

To  the  best  of  our  knowledge,  this  is  the  first  time  our  simple  EMS  algorithm  has 
been  proposed.  There  are,  however,  some  other  suggestions  for  incorporating  smooth¬ 
ing  into  the  EM  algorithm  in  PET  and/or  SPECT  in  the  recent  literature.  We  have 
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already  mentioned  a  penalised  likelihood  approach  —  an  EMp  algorithm  —  in  Section 
2.  In  the  PET  context,  the  Mp  step  is  essentially  the  same  problem  as  that  arising  in 
image  processing  problems  which  are  approached  by  maximum  a  posteriori  estimation 
(see  Geman  and  Geman,  1984,  and  Besag,  1986).  Even  when  an  appropriate  penalty 
function  or  prior  distribution  has  been  decided  upon  —  locally  dependent  Markov  ran¬ 
dom  fields  form  a  class  of  priors  capable  of  quantifying  notions  of  local  smoothness 
(Besag,  1986)  —  the  computational  problem  of  locating  the  global  maximum  of  the 
penalized  likelihood  is  immense  and  not  yet  satisfactorily  solved  (Greig  et  ai.,  1986). 
Obtaining  a  local  maximum  at  the  Mp  stage  is  more  reasonable.  A  successful  method 
for  finding  a  "good"  local  maximum  in  image  processing  is  Besag’s  (1986)  iterated 
conditional  modes  (ICM)  algorithm.  Roughly  speaking,  ICM  is  not  all  that  different 
from  our  simple  local  smoothing:  it  performs  a  few  iterations  of  a  sequential  local 
smooth  (i.e.  "current"  pixel  values  include  those  already  updated,  not  just  the  originals) 
using  a  local  smoother  based  on  maximising  a  penalized  marginal  likelihood.  We 
would  not  be  surprised  to  find  that  the  ICM  approach  yields  good  reconstructions;  we 
wonder,  though,  if  even  its  level  of  sophistication  will  ultimately  prove  to  be 
worthwhile.  Indeed,  Geman  and  McClure  (1985,  1987),  considering  the  application  of 
such  methods  in  the  context  of  SPECT,  decided  to  fall  short  of  a  full  implementation 
of  such  an  Mp  algorithm.  Rather,  they  obtained  a  reconstruction  by  some  other  method 
to  act  as  initial  estimate  and  then  applied  a  single  Mp  step  of  the  above  sort.  Perhaps  a 
better  perspective  on  Geman  and  McClure's  approach  is  as  the  application  of  popular 
image  processing  techniques  to  cleaning  up  reconstructions  obtained  in  other  ways. 
Note  too  that  Geman  and  McClure  (1987)  consider  posterior  mean  reconstructions 
(their  penalised  likelihoods  are  posterior  distributions)  as  alternatives  to  posterior 
modes.  Less  appealing  to  the  current  authors  are  other  EMp  approaches  utilising 
pixel- by-pixel  priors  designed  to  encourage  smoothing  towards  prespecified,  or 
estimated,  images.  Examples  are  given  by  Hart  and  Liang  (1987),  Lange  et  al.  (1987) 
and  Levitan  and  Herman  (1987).  Other  regularisation  procedures,  based,  we  believe, 
on  inappropriate  roughness  penalty  functions,  are  considered  by  Girard  (1987)  and 
Miller  and  Snyder  (1987). 

A  rather  different  approach  to  smoothed  EM  algorithms  for  PET  is  taken  by 
Snyder  and  Miller  (1985)  (also  Miller,  Snyder  and  Moore,  1986,  and  Snyder  et  al., 
1987).  These  authors  force  their  emission  density  estimate  to  have  kernel  convolution 
form  i.e. 

X(j)  =.  j@(s,z)0{z)dz  s=l,...,S, 

for  0  a  known  kernel,  and  9  is  estimated  by  ML  —  this  is  a  kernel  convolution  sieve 
(Geman  and  Hwang,  1982).  This  is  identical  with  replacing  the  point-spread  function 
by  a  kernel-smoothed  version  of  it. 
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p9(s,t)  =  £9(s,r)p(r,r), 

r 

say,  and  proceeding  by  the  EM  algorithm.  As  a  last  step,  this  ML  solution  is  smoothed 
once  by  application  of  9.  This  approach,  while  certainly  yielding  smooth  images, 
requires  the  (spiky)  ML  solution  to  what  is  an  even  more  ill-posed  problem  (caused  by 
the  smoothing  effect  of  0)  and  so  it  retains  and  perhaps  even  exacerbates  all  the 
numerical  convergence  problems  of  obtaining  true  ML  reconstructions  by  the  EM 
method. 

4.3.4  Closing  Remarks  on  the  PET  Application 

A  first  striking  feature  of  the  reconstructions  shown  in  this  paper  is  the  similarity 
between  that  obtained  by  the  uniform  start  /  early  termination  modification  of  basic 
EM,  in  Fig.  4.6,  and  the  "best"  weighted  local  mean  EMS  reconstruction  shown  in  Fig. 
4.11.  We  have  certainly  shown  that  nothing  need  be  lost  in  terms  of  reconstruction 
quality  by  the  introduction  of  our  explirit^moo thing  procedure  and  would  argue  that 
the  latter  image  is  indeed  a  slight  improvement  over  the  former.  Moreover,  the  EMS 
formulation  offers  prospects  of  further  improvement:  other  local  smoothing  schemes 
can  be  fitted  into  the  same  framework  and  might  work  better,  while  the  benefits  of  the 
provision  of  an  apparently  uniquely  convergent  algorithm  include  scope  for  further 
computational  improvement  such  as  accelerating  that  convergence. 

5.  SOME  THEORETICAL  BACKGROUND 

The  clear  empirical  success  of  the  EMS  algorithm  immediately  asks  several 
theoretical  questions.  It  has  been  observed  in  practice  that  the  EMS  algorithm  employ¬ 
ing  linear  smoothers  converges  relatively  quickly  and  that  its  limit  point  is  apparently 
unique.  Obviously  it  would  be  of  interest  to  prove  these  properties  rigorously.  Unfor¬ 
tunately,  we  have  not  been  able  to  do  so,  but  in  this  section  we  provide  a  heuristic  dis¬ 
cussion  that  relates  the  EMS  procedure  to  an  EM^  approach  where  the  likelihood  is 
penalised  by  a  term  that  is  quadratic  in  the  vector  of  square  roots  of  the  intensities. 
This  relationship  gives  some  insight  into  the  good  properties  of  EMS  and  it  is  our 
hope  that  it  will  be  a  useful  starting  point  for  future  theoretical  work. 

5.1  A  Lemma 

The  first  step  in  our  development  is  a  simple  algebraic  lemma. 

Lemma.  Suppose  that  W  is  a  diagonal  matrix  of  weights  and  that  S  is  a  matrix  for 
which  SrJ  >  0  for  all  r  and  s  and  £Srs  =  1  for  all  r.  Suppose  that  for  some  5 >  0 

5 

I  w;xws  - 1  I  <.s 
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for  all  (r,s)  such  that  Sn  *  0;  here,  the  w’s  are  the  diagonal  elements  of  W.  Define  a 
matrix  T  by  T  =  W_1SW.  Then  for  any  vector  x 

|  (Tx  -  Sx)r  |  £  <5sup  |  xu  | .  (5.1) 

II 

Proof.  |CTx-Sx)r|  =  |X5r,(w71w,-l)x,|^Jsup|x„|£Sr,  =  <5sup|xtt|.  □ 

The  implication  of  the  lemma  in  the  current  context  is  as  follows.  Suppose  that  x  is 
indexed  by  our  reconstruction  bins  and  that  S  is  a  local  smoothing  operator  so  that 
Srs  =  0  unless  r  and  s  are  neighbouring  bins.  Suppose  that  W  is  an  array  of  weights 
that  vary  continuously  over  the  space  Y,  that  is,  wr  =  ws  whenever  r  and  s  are  neigh¬ 
bours;  then  8  can  be  chosen  to  be  small.  The  operator  T  corresponds  to  weighting  an 
intensity  by  the  w  weights,  smoothing  by  the  operator  S  and  then  unweighting.  The 
lemma  therefore  quantifies  the  intuitive  notion  that  S  and  T  wifi  have  approximately 
equal  effects. 

52.  A  Relationship  Between  Local  Smoothing  and  Maximum  Penalised  Likelihood  for 

Poisson  Random  Variables 

Write  v  for  the  vector  of  <pfs  where  <ps -X{s)t a{s)  as  in  Section  4.1.3  and 
define  xs  =  for  all  s  with  x  =  (jtj  , ...,  xs).  Also  write  'F  as  the  diagonal  matrix 
with  diagonal  elements  y/s  =  q(s)a(s).  Let  S  be  a  smoothing  matrix  all  of  whose 
eigenvalues  lie  in  (0,1]  and  define  R  =  ^(S-1  Suppose  that  observations  m(s) 

of  independent  Poisson (<ps\ffs)  random  variables  are  available  and  let  lp (x)  be  the  log 
likelihood  penalised  by  i.e. 

lp(*)  -  2>(s)  Iog^V,)  ~  2XV*  -  x?Rx. 

J  S 

To  see  that  iJ Rx  has  the  effect  of  being  a  roughness  penalty,  note  that  the  eigen¬ 
vectors  corresponding  to  large  eigenvalues  of  R  will  be  those  corresponding  to  small 
eigenvalues  of  S,  and  so  will  consist,  loosely  speaking,  of  high  frequency  oscillations. 
The  following  theorem  demonstrates  a  connection  between  the  penalised  ML  estimate 
of  x,  and  the  estimate  obtained  by  direct  smoothing  of  the  ML  estimate  of  <p  and  by 
taking  square  roots. 

Define  ft  to  be  the  maximiser  of  lp{x)  in  {x,>0}.  Set  WsH^rF1,  where 
ft  =  diag(;D).  The  ML  estimate  of  <p  is  Y^'m.  Define  T  =  W!SW  as  in  Section  5.1 
and  let  ft  =  (ft  i fts)  where  fts  =  CPF-1  m)j  for  each  s. 

Theorem.  With  the  above  definitions,  ft  =  it. 

Proof.  Write 
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lp(x )  =  -xrC¥  +  R)*  +  2  '£m(s)logxs  +  2m(.f)log  ys. 

*  * 

Hence  the  Hessian  matrix  of  lp  is  -20?  +  R  +  diag{  m(.r  )/*/)),  which  is  strictly  nega- 
)  tive  definite  in  [xs  >  0},  and  so  &  will  be  uniquely  defined  by  V/p0?)  =  0.  This  is  true 

f  if  and  only  if 

■  {OF  +  RJjfc},  =  m(s)  for  each  s. 

It  is  easy  to  see  that  *?  +  R  =  '?*S-1'?4  and  that,  if  p  is  the  vector  of  x/’s, 
»  <p  =  '?^W"1j?.  Therefore,  the  vector  with  components  A, {0?  +  R)*},  is  equal  to 

ftOF  +  R)*  =  (Wlv?i)0?iS~lx?i)'?-iW^  =  'PW1S_1Wp  =  'FT'1?- 
Thus,  m  =  '?T_1^  so  that  9  =  and  therefore  =  OP?-1  m)i  for  all  s.  □ 

*  Of  course,  the  smoothing  matrix  T  depends,  through  the  weights  ws,  on  &  and  so 

A  « 

the  expression  of  ^  as  a  smoothed  version  of.'F-lm  is  .not  immediately  of  practical 
use.  However,  a  heuristic  argument  based  on  the  lemma  relates  the  smoothed  ML  esti¬ 
mate  q>*  sS*?-1™  to  <p  as  follows.  Since  the  penalty  x^Rx  can  be  expected  to  penal¬ 
ise  for  roughness  in  x,  the  penalised  ML  estimates  {x,}  will  vary  continuously.  Pro¬ 
vided  the  yrs’s  also  vary  continuously,  so  will  the  weights  {w,}  and  hence,  by  the 
lemma,  the  effects  of  smoothing  by  the  operators  S  and  T  will  be  almost  identicaL 
Thus  f>*  =  S'F_1m  =  PF-1ni  =  $>.  Note  that  in  the  PET  context,  the  yrs' s  do  not  vary 
continuously  across  the  seams  in  our  discretisation  of  Y  but  this  does  not  appear  in 
practice  to  have  an  important  effect  In  the  stereology  example,  there  is  no  such 
discontinuity. 


53  Smoothed  EM  and  Penalised  EM 

Return  now  to  the  EM  algorithm  and  consider  the  construction  of  an  EMp  algo¬ 
rithm  to  maximise  /(n  |  A)  penalised  by  a  term  x^Rx  as  above.  Recall  that,  at  each 

iteration,  in  the  notation  of  Section  2,  m(i)=2I(j,t).  If  the  Mp  step  is  then  approxi- 

/ 

mated  by  finding  the  smoothed  ML  estimate  A*(s)  =  p,*a(s),  where  q>*  =  Sx¥~lHi, 
then  the  effect  is  precisely  an  iteration  of  the  EMS  algorithm  using  the  smoothing 
operator  S.  Thus,  each. EMS  iteration  corresponds  approximately  to  an  iteration  of  the 
EMp  -algorithm  with  the  penalty  on  the  square  roots  of  the  intensities;  this  is  the  point 
we  aimed  to  make. 

This  heuristic  equivalence  may  account  for  the  rapid  convergence  of  the  EMS 
algorithm;  see  the  remarks  of  Dempster  et  al.  (1977)  about  the  EMp  algorithm.  We 
have  been  unable  to  prove  that  the  penalised  likelihood  has  a  unique  maximum  but  our 
empirical  experience  suggests  that  this  is  so.  Certainly  it  will  be  the  case  in  general 
that  at  any  maximum  of  the  penalised  likelihood  the  Hessian  matrix  is  positive  definite 
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so  the  maximum  will  be  strict 

6.  SUGGESTIONS  FOR  FURTHER  WORK 

We  have  introduced  a  simple  algorithm  that  is  widely  applicable  to  a  large  class 
of  problems  involving  indirect  observations.  Clearly,  there  is  much  scope  to  supple¬ 
ment  our  fruitful  empirical  studies  by  further  theoretical  and  practical  work.  In  particu¬ 
lar,  the  work  of  Section  5  might  be  carried  further.  Once  this  is  done,  it  would  then 
be  of  interest  to  study  the  theoretical  properties  of  the  EMS  solution,  both  for  their 
own  sake  and  in  order,  hopefully,  to  give  insight  into  the  choice  of  smoothing  parame¬ 
ter.  Alternative  smoothing  schemes  are  also  of  interest 

The  whole  area  of  statistical  methods  for  indirect  data  is  not  enormously  well 
understood.  One  interesting  question  is  that  of  quantifying  the  information  loss 
inherent  in  the  indirectness  of  the  data.  Johnstone  and  Silverman  (1988)  have  made 
progress  for  the  PET  problem  towards  finding  an  equivalent  "direct  sample  size",  i.e. 
the  number  of  emissions  whose  exact  position  would  have  to  be  observed  to  give  the 
same  accuracy  of  estimation  as  the  given  sample  of  indirect  observations. 
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FIGURE  LEGENDS 

Fig.  3.1.  An  EM  reconstruction  ( - )  of  the  truncated  Weibuli  density  ( - ). 

Fig.  3.2.  An  EM  reconstruction  ( - )  of  the  truncated  normal  mixture  density 

( - )• 

Fig.  3.3.  An  EMS  reconstruction  using  7  =  5  ( - )  of  the  truncated  normal  mix¬ 
ture  density  ( - ).  The  reconstruction  is  based  on  the  same  data  as  Fig.  3.2. 

Fig.  3.4.  Another  EMS  reconstruction  using  7  =  5  ( - )  of  the  truncated  normal 

mixture  density  ( - ).  The  reconstruction  is  based  on  a  different  dataset 

Fig.  3.5.  An  EMS  reconstruction  using  7=9  ( - )  of  the  truncated  Weibuli  den¬ 
sity  ( - ).  The  reconstruction  is  based  on  the  same  data  as  Fig.  3.1. 

Fig.  3.6.  The  EMS  reconstruction  using  7  =  5  ( - )  of  the  sphere  radius  intensity 

for  the  mouse  liver  data. 

Fig.  4.1.  A  planar  section  through  an  elliptical  "body"  within  a  circular  detector 
set;  edges  of  individual  detectors  are  marked.  An  emission,  at  *,  yields  a  randomly 
orientated  line  in  3-space.  Two  such  possible  lines  are  shown. 

Fig.  4.2.  Our  discretisation  of  the  disc. 

Fig.  4.3.  The  phantom. 

Fig.  4.4.  A  cross-section  through  a  tube  showing  the  distances  defined  in  the  text 
The  annihilation  spot  is  marked  *. 

Fig.  4.5.  Reconstruction  after  200  EM  iterations. 

Fig.  4.6.  Reconstruction  after  24  EM  iterations. 

Fig.  4.7.  Neighbours  in  the  circular  discretisation  scheme:  O  =  pixel  of  interest, 
*  =  first  order  neighbour,  x  =  second  order  neighbour.  There  arc  eight  different  pixel 
types  in  all;  two  of  these  are  illustrated  on  separate  insets. 

Fig.  4.8.  EMS  reconstruction  using  local  median  smoother. 

Fig.  4,9.  EMS  reconstruction  using  2AVE  smoother. 

Fig.  4.10.  EMS  reconstruction  using  weighted  local  mean  smoother  with  W  =  200. 

Fig.  4.1 1.  EMS  reconstruction  using  weighted  local  mean  smoother  with  W  =  50. 

Fig.  4.12.  EMS  reconstruction  using  weighted  local  mean  smoother  with  W  =  10. 
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SUMMARY  Positron  emission  tomography  (PET)  is  an  important  medical  imaging 
technique.  Statistically,  the  PET  image  reconstruction  problem  comprises  estimating 
the  intensity  function  of  a  nonhomogeneous  Poisson  process  from  a  set  of  indirectly 
observed  data  (an  integral  transform  is  involved).  In  this  paper,  we  investigate  a  new 
reconstruction  method  consisting  in  the  adaptation  of  orthogonal  series  density  estima¬ 
tion  techniques  to  use  with  an  idealised  form  of  the  PET  problem.  The  method  pro¬ 
vides  reasonable  reconstructions  quickly;  its  computational  speed  is  its  major  advan¬ 
tage.  It  has  further  advantages  (e.g.  no  pixellation  required)  and  various  disadvan¬ 
tages  (e.g.  difficulties  with  object  boundaries,  non-negativity  not  guaranteed)  which 
are  discussed.  Its  major  disadvantage,  however,  is  the  difficulty  associated  with  gen¬ 
eralising  the  approach  to  cope  with  more  realistic  versions  of  the  PET  model. 

I  Introduction 

It  is  often  desired  to  infer  something  about  the  internal  structure  of  an  object  when  to 
look  at  that  structure  directly  is  impossible.  Instead,  we  may  be  able  to  obtain  meas¬ 
urements  external  to  the  object  which  are,  in  some  way,  derived  from  the  internal 
structure  of  interest  and  from  which  we  might  hope  to  be  able  to  estimate  that  struc¬ 
ture.  This  scenario  occurs  frequently  in  medicine.  Suppose,  for  concreteness,  interest 
centres  on  a  patient’s  brain  and,  especially,  in  the  metabolic  activity  in  a  particular 
slice  through  the  brain.  An  idealised  image  illustrating  the  kind  of  pattern  of  activity 
we  might  expect  to  obtain  is  shown  in  Fig.  1.  Here,  grey  levels  are  used  to  represent 
different  levels  of  activity.  How  do  we  get  at  such  a  useful  portrait  of  unobservable 
features? 

The  particular  technique  for  this  kind  of  investigation  with  which  we  are  con¬ 
cerned  in  this  paper  is  positron  emission  tomography  (PET).  In  PET,  radioactive 
material  is  introduced  into  the  area  of  interest  —  often  tagged  glucose  in  the  brain  — 
with  the  idea  that  it  distributes  itself  around  in  direct  proportion  to  the  property  (meta¬ 
bolic  activity)  of  interest.  The  radioactive  tracer  emits  positrons,  each  of  which  in  turn 
creates  (in  conjunction  with  a  nearby  electron)  a  pair  of  X-ray  photons  which  fly  off  in 
opposite  directions  and  which  can  be  detected  externally;  the  point  of  photon  genera¬ 
tion  corresponding  to  a  typical  emission  is  marked  on  Fig.  1  by  a  circle,  together  with 
two  lines  through  the  point  representing  two  potential  photon  paths  which  in  fact  occur 
at  a  uniformly  distributed  random  angle.  An  array  of  detectors  positioned  around  the 
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patient  —  in  Fig.  1,  they  form  a  division  of  the  outer  circle  into  D  =  128  arcs  of  equal 
length  —  registers  coincident  photon  arrivals.  Thus,  our  data  are  the  counts  amassed  in 
the  7=  D(D-l)/2  =  8128  "tubes"  defined  by  all  pairs  of  detectors.  In  typical  PET 
applications,  the  total  emission  count  numbers  from  several  hundreds  of  thousands 
upwards.  The  present  and  potential  usefulness  of  PET  and  other  medical  tomographic 
techniques  is  considerable.  Research  interests  in  the  many  stages  that  make  up  a  com¬ 
plete  PET  system  cover  a  wide  variety  of  disciplines.  There  is  a  number  of  important 
statistical  questions  concerned  with  PET  of  which  just  the  most  obvious  one  of  best 
reconstructing  the  internal  image  from  the  external  observations  is  considered  here.  For 
a  general  introduction  to  PET,  see  Phelps,  Mazziota  &  Schelbert  (1986);  for  more  dis¬ 
cussion  of  the  idealised  PET  setup  in  which  we  work  here,  see  Section  2. 

PET  therefore  provides  a  challenging  image  analysis  problem  which  differs  from 
many  image  analysis  problems  in  two  important  ways.  The  first  of  these  lies  in  the 
,  indirect  nature  of  the  image  observation  process  described  above.  Many  other  prob¬ 
lems,  such  as  those  discussed  in  Besag  (1986)  for  example,  concern  noisy  direct  obser¬ 
vation,  in  the  sense  that  what  is  observed  in  each  pixel  depends  only  on  the  true 
scene’s  value  in  that  pixel,  and  not  elsewhere,  together  with  some  modifying  noise 
process.  Here,  emissions  from  completely  different  areas  of  the  brain  contribute  to  the 
same  data  values  since  all  that  each  datum  registration  means  is  that  an  emission 
occurred  somewhere  in  the  given  tube.  In  fact,  observation  intensity  and  image  inten¬ 
sity  functions  are  related  by  an  integral  transform  given  in  Section  2.  The  second 
difference  between  PET  and  many  other  superficially  similar  problems  is  that  the 
image  of  interest  is  the  intensity  function  of  a  nonhomogeneous  Poisson  process  — 
emissions  occur  uniformly  throughout  areas  of  constant  activity  in  Fig.  1  but  with  rates 
differing  between  areas  in  direct  proportion  to  the  respective  activity  levels  —  and 
direct  data,  if  available,  would  be  a  realisation  of  that  Poisson  process;  this  contrasts 
with  data  which  are  values  of  some  true  regression-type  function  observed  with  error. 

There  are  several  popular  techniques  for  nonparametric  estimation  of  an  intensity 
function,  or  equivalendy  of  a  probability  density  function,  available  in  the  literature 
(see  Silverman,  1986)  for  the  case  of  directly  observed  data.  Here,  we  investigate  the 
application  of  one  of  these  —  orthogonal  series  intensity  estimation  —  to  the  PET 
problem  concerning  indirect  observations.  It  turns  out  that  the  orthogonal  series 
approach  extends  easily  and  naturally  to  the  indirect  case,  at  least  for  one  particular 
idealisation  of  the  PET  reconstruction  problem;  details  are  given  in  Section  3. 

The  current  work  provides  a  practically  oriented  companion  paper  to  the  theoreti¬ 
cal  investigation  of  Johnstone  &  Silverman  (1988).  Johnstone  &  Silverman  were  con¬ 
cerned  with  quantifying  the  ill-posedness  of  the  PET  problem.  In  particular,  they  cal¬ 
culated  theoretically  the  order  of  magnitude  of  the  size  of  a  sample  of  directly 
observed  positron  emissions  that  would  be  required  to  be  equivalent  to  a  given  sample 
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size  of  the  indirectly  observed  data  which  is  available  in  practice,  in  the  sense  of  yield¬ 
ing  equally  accurate  image  reconstructions.  They  conclude  that  their  "results  confirm 
intuition  that  for  the  PET  problem,  the  amount  of  information  available  is  still  substan¬ 
tial,  but  it  is  by  no  means  as  great  as  if  a  sample  of . .  .  direct  observations  were  avail¬ 
able".  Johnstone  &  Silverman  introduce  the  orthogonal  series  intensity  estimation 
method  as  a  purely  theoretical  device  to  aid  their  investigation.  They  mention  that  it 
"might  be  used  as  the  basis  for  practical  reconstructions".  Here,  we  follow  up  this 
suggestion. 

Various  properties  of  the  orthogonal  series  intensity  estimation  approach  to  PET 
image  reconstruction  are  investigated  in  later  sections  of  the  paper.  In  Section  S,  the 
method  is  applied  to  a  simulated  example.  It  is  possible  to  understand  how  the  orthog¬ 
onal  series  smoothing  works  by  displaying  pictures  of  the  "equivalent  weight  function" 
which  a  weight  function  estimate  based  on  direct  observations  would  need  to  employ 
to  obtain  the  same  answers;  this  is  done  in  Section  6.  In  Section  7,  a  proposal  for  the 
automatic  choice  of  the  smoothing  parameter  associated  with  this  method  is  made. 

Broadly  speaking,  techniques  for  image  reconstruction  in  PET  fall  into  two 
categories.  On  the  one  hand,  the  best  quality  estimates  thus  far  available  derive  from 
iterative  algorithms  which  are  costly  in  terms  of  computer  time.  One  such  class  of 
methods  is  based  on  the  EM  algorithm,  as  developed  by  Vardi,  Shepp  &  Kaufman 
(1985),  for  which  much  recent  interest  has  centred  on  incorporating  some  kind  of 
smoothing  —  see  Silverman  et  al.  (1988)  for  our  own  contribution  to  this  area  and 
further  references.  On  the  other  hand,  practical  PET  implementations  tend  to  use  dif¬ 
ferent  algorithms  which  are  much  quicker  to  compute  but  sacrifice  something  in  terms 
of  image  accuracy.  A  favourite  example  of  this  type  is  the  "convolution  backprojec- 
tion"  method  described  in,  for  example,  Shepp  &  Kruskal  (1978).  We  see  the  orthogo¬ 
nal  series  intensity  estimation  approach  as  fitting  more  into  the  latter  category  although 
the  quality  of  the  resulting  reconstructions  remains  fairly  good.  A  major  disadvantage 
of  the  proposed  method,  however,  is  the  difficulty  associated  with  generalising  the 
approach  to  cope  with  more  realistic  versions  of  the  PET  problem.  Further  discussion 
of  the  pros  and  cons  of  the  orthogonal  series  approach  is  given  in  the  closing  Section 
9. 

2  More  on  PET 

The  idealised  PET  setup  that  we  have  briefly  described  in  Section  1  is  the  one  dis¬ 
cussed  by  Vardi  et  al.  (1985)  in  a  paper  that  provides  an  excellent  introduction  to  the 
topic  for  the  statistician.  In  practice,  there  arc  a  number  of  potentially  important  factors 
—  such  as  time-of- flight  considerations,  attenuation  problems,  scattering  and  so  on  — 
which  are  ignored  in  this  model;  they  serve  to  modify  the  integral  transform  linking 
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points  of  emission  and  the  data  at  hand  and  should,  if  possible,  be  incorporated  in 
practical  situations.  An  effect  due  to  the  nonzero  thickness  of  the  detectors,  first 
included  in  the  model  by  Silverman  et  al.  (1988),  is  also  omitted  in  this  paper,  but  see 
Section  8. 

Our  notation  follows  that  of  Johnstone  &  Silverman  (1988,  Section  2.2)  and  is 
briefly  reviewed  here.  We  first  consider  an  entirely  continuous  version  of  the  PET 
model:  as  well  as  the  naturally  continuous  "brain  space"  (the  unit  disc),  parametrised 
by  the  usual  polar  coordinates  ( r,8 ),  0  £  r  £  1,  0£9<2ir,  suppose  the  "detector  space" 
consists  not  of  the  T  tubes  of  reality,  but  is  the  space  of  all  possible  unordered  pairs  of 
points  on  the  unit  circle.  Parametrise  detector  space  in  a  polar  fashion  too:  elements  of 
this  space  are  given  by  (s,p),  0£s£l,0£p<2*  where  s  is  the  length  of  the  perpen¬ 
dicular  from  the  origin  to  the  detected  line  and  <p  is  the  orientation  of  that  perpendicu¬ 
lar  (see  Fig.  2.2  of  Johnstone  &  Silverman,  1988).  It  is  convenient  to  renormalise  the 
emission  intensity  to  be  a  probability  density  function  f(r,9),  say,  with  respect  to  nor¬ 
malised  Lebesgue  measure  n,  where  dfi(r,9)  =  x~lrdrd6.  Write  g(s,<p)  =  (Pf)(s,<p) 
for  the  probability  density  in  detector  space  with  respect  to  the  transformed  measure  X 
given  by  dX{s,<p)  =  2x~2{\-s2)^dsdtp.  The  mapping  P  is  the  well-known  Radon 
transform  of  the  density  /  given  by 

Vri-J1) 

(Pf)(s,<p)  ~  i(l-s2)'^  j  f(s  cos  <p  - 1  sin  <p,  s  sin  <p  +  t cos  <p)  dt.  (1) 

-Vd-/2) 

As  is  intuitively  clear,  the  Radon  transform  represents  the  average  value  of  /  over  the 
line  connecting  the  pair  of  points  on  the  circle.  See  Johnstone  &  Silverman  (1988)  for 
more  details  of  the  above  and  Deans  (1983)  for  a  good  introduction  to  the  Radon 
transform  in  general. 

In  reconstructing  PET  images  in  this  paper,  we  maintain  the  continuous  nature  of 
brain  space  but  are  forced  to  discretise  detector  space.  The  former  continuity  contrasts 
with  many  other  reconstruction  methods  (including  those  of  Vardi  et  al.,  1985,  and 
Silverman  et  al.,  1988)  which  work  with  a  discrete  pixellation  of  the  disc.  The  latter 
discretisation  of  detector  space  is  an  irremovable  constraint  due  to  the  physical  setup. 
We  denote  the  corresponding  discrete  tubecounts  by  nt,  t  =  1 , ...,  T  where  the  order  of 
indexing  tubes  by  t  is  immaterial. 

3  Appropriate  Orthogonal  Series  Estimation 

We  wish  to  estimate  the  emission  intensity  /.  If  direct  observations  drawn  from  /  were 
available,  the  usual  orthogonal  series  estimation  paradigm  is  as  follows.  Firstly,  expand 
/  in  terms  of  orthonormal  functions  (  tjv  }  i.e.  write 

f(r,0)  =  £/v7?v(r,0). 
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Secondly,  estimate  the  coefficients  {/„}  by  the  average  of  rjv  over  the  sample  of 
(r,0)’s;  call  these  {/„}.  Finally,  introduce  some  smoothing  either  by  a  collection  of 
tapering  weights  or,  as  here,  by  cutting  off  the  potentially  infinite  sum  after  some  finite 
number,  K,  of  terms.  Our  estimate  is  then 

f(r,9)  =  %?vTiv(r,0).  (2) 

viK 

See  Section  2.7  of  Silverman  (1986)  for  an  account  of  this  approach  to  density  estima¬ 
tion  and  Section  7  of  Tzenman  (1988)  for  more  references.  Since  our  /  is  a  bivariate 
function,  v  represents  a  double  subscript 
We  can  equally  well  expand  g  as 

g(s,<p)  =  SSvK'vU-P) 

V 

for  appropriate  functions  [  yry)  and  use  a  similar  procedure  to  estimate  g.  Note  that  the 
gv’s  are  practically  calculable  from  our  indirect  data.  Now,  provided  that  the  orthonor¬ 
mal  sets  { Tfv )  and  { ipv }  are  such  that  there  exists  a  set  { bv )  of  positive  real  numbers 
with 

(Pvv)(s,<p)  =  bv\ffv(s,<p),  (3) 

we  can  write  gv  =  bvfv  so  that 

f(r,9)  =  £ b~lgvT]v(r,9 ). 

V 

The  natural  orthogonal  series  estimate  of  /  based  on  indirect  observations  is  therefore 

f(r,9)  =  £  b~xgvriv(r,9).  (4) 

v*K 

The  fact  that  a  set  of  quantities  with  the  above  properties  —  a  singular  value 
decomposition  —  exists  for  the  Radon  transform  (see  Deans,  1983,  Section  7.6)  is 
what  makes  the  orthogonal  series  intensity  estimation  approach  applicable  to  our  ideal¬ 
ised  PET  model.  In  brain  space,  the  appropriate  orthonormal  functions  are 

iF„(r,0)*(m+l)»Zi(r)e".  (5) 

We  have  written  v  as  m  =  0,  1,2,...  is  what  becomes  truncated  at  K,  while  / 
varies  from  -m  to  m  in  steps  of  2.  The  functions  Z^(r)  are  the  Zernike  polynomial a  of 
degree  m  and  order  /  which  have  a  history  of  application  in  optics  (Bom  &  Wolf, 
1980).  See  Deans  (1983,  Section  7.6)  for  their  properties.  In  detector  space,  we  take 

yrv(s,V)  =  Un(s)el1*  (6) 

where  Um{s)  is  a  Chebyshev  polynomial  of  the  second  kind  (see  Deans,  1983,  Appen¬ 
dix  C).  The  singular  values  { bv  J  are  given  very  simply  by 
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bv  =  (m+l)-i.  (7) 

Of  course,  the  orthonormal  functions  in  (5)  and  (6)  are  real-valued.  In  each,  the  quan¬ 
tity  of  the  form  e  is  simply  a  useful  shorthand  for  coping  with  sine  and  cosine 
terms;  in  appropriate  combination,  all  imaginary  terms  disappear.  More  details  on  the 

above  development  can  be  found  in  Sections  S  and  6  of  Johnstone  &  Silverman 

(1988). 

The  discrete  nature  of  the  tubecount  data  affects  the  estimates  [§v)  of  {#„}. 
Suppose  the  line  parallel  to  the  sides  of  tube  t  but  located  at  its  centre  has  coordinates 
( st,9t ).  Then  we  use  the  natural  sample  average  of  yv  based  on  the  grouped  data, 
namely 

L  =  N~l '£ntyv(st,<pl f),  (8) 

r=l 

the  bar  denoting  complex  conjugation.  Here,  N  -  £  nt  is  the  total  number  of  emis¬ 
sions.  Plugging  (7),  (8)  and  the  definitions  (5)  and  (6)  into  (4)  yields  a  complete 
description  of  /  (for  fixed  K).  Recursions  involved  in  calculating  both  types  of  orthog¬ 
onal  polynomial  help  to  keep  the  computational  burden  down. 

4  Presentation  of  Figures 

Figs  2  to  8  are  all  grey  level  images  of  PET  image  reconstructions  and  related  quanti¬ 
ties.  Each  uses  32  grey  levels  scaled  in  a  rather  arbitrary  way,  increases  in  darkness 
representing  increases  in  (estimates  of)  metabolic  activity.  Orthogonal  series  intensity 
estimation  results  in  (high  order)  polynomial  surfaces  defined  at  ail  points  of  the  disc. 
Representing  such  smooth  functions  is  a  task  well  suited  to  the  application  of  a  high 
quality  contouring  package;  in  our  figures,  we  have  used  the  excellent  CONICON3 
programs  of  Sibson  (1987).  The  grey  level  images  result  from  suppressing  drawing  of 
the  contours  themselves  and  filling  in  the  areas  between  successive  contours  with 
appropriate  shades  of  grey.  CONICON3  requires  value  and  gradient  information  on  the 
function  to  be  contoured  only  at  a  regular  grid  of  values  —  a  20  x  20  square  grid  usu¬ 
ally  sufficed  here.  Computation  and  presentation  of  the  images  given  in  this  paper 
were  performed  on  a  SUN  3/160  workstation,  copies  of  the  pictures  being  produced  by 
an  Apple  LaserWriter  II  printer. 

5  A  Simulated  Example 

We  illustrate  use  of  the  orthogonal  series  intensity  estimation  algorithm  on  data  simu¬ 
lated  from  the  image  —  the  "phantom"  —  shown  in  Fig.  1.  This  phantom  is  a  piece- 
wise  constant  function  made  up  of  elliptical  areas  of  constant  intensity  (representing 
ventricles,  tumours  and  so  on)  on  a  large  background  ellipse  (the  head).  The  key 
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property  of  this  idealisation  which,  we  believe,  transfers  to  real  images  is  the  presence 
of  edges  of  features  at  which  there  may  be  a  considerable  jump  in  intensity;  ideally, 
we  would  like  to  estimate  such  edges  well.  The  constancy  property  (within  objects) 
may  prove  less  realistic  than  some  land  of  smooth  variation,  but  this  is  less  of  an 
issue.  This  phantom  is  essentially  the  same  as  that  of  Fig.  2  of  Vardi  et  al.  (198S)  and 
war  also  used  in  Silverman  et  al.  (1988).  Fig.  1  has  something  of  a  discretised  look 
about  it,  having  been  obtained  by  using  CONICON3  on  a  fine  100  x  100  grid;  this 
comes  about  since  we  are  applying  CONICON3  to  an  entirely  inappropriate  piecewise 
constant  function!  Nonetheless,  Fig.  1  bears  comparison  with  the  discretised  version  of 
the  phantom  given  as  Fig.  4.3  of  Silverman  et  al.  (1988),  giving  a  good  impression  of 
the  features  present  in  the  image  and  serving  as  a  kind  of  bound  on  how  well  the  true 
phantom  could  be  reconstructed  using  the  representation  tools  at  hand.  A  total  of 
N=  106  emissions  —  commensurate  with  real  applications  —  was  generated  from  this 
intensity  function  using  the  acceptance  /  rejection  method  in  the  obvious  way.  The 
corresponding  tubecounts  form  the  data  for  this  experiment 

Figs  2  to  4  are  three  reconstructions  obtained  from  these  data;  they  correspond  to 
K  =  10,  36  and  50,  respectively.  The  first  (Fig.  2)  is  clearly  oversmooth.  It  is 
encouraging  that  even  here  large  features  present  in  the  phantom  are  reproduced  to 
some  extent  but  the  total  disappearance  of  the  smaller  objects  gives  cause  for  concern. 
Figs  3  and  4  are  progressively  less  smooth.  By  the  time  AT  =  50  (Fig.  4)  it  can  be 
argued,  given  knowledge  of  the  true  image,  that  even  the  smaller  features  are  indicated 
fairly  well  but,  of  course,  that  (practically  unobtainable)  knowledge  is  required  to  dif¬ 
ferentiate  the  small  objects  on  the  reconstruction  that  should  be  there  from  the  others 
such  an  undersmoothed  reconstruction  gives  that  should  not  On  balance,  the  choice 
AT  =  36  (Fig.  3)  seems  to  be  about  as  good  as  we  can  get  Large  features  are  well 
represented;  there  can  be  rather  less  confidence,  though,  in  the  smaller  structure.  Of 
course,  the  smooth  polynomial  nature  of  our  reconstruction  method  is  a  drawback 
when,  as  here,  piecewise  smooth  areas  with  considerable  discontinuities  in  value  at 
feature  boundaries  make  up  the  true  image.  The  reader  is  left  to  append  his  or  her  own 
adjectives  to  the  goodness  or  otherwise  of  Fig.  3  as  an  approximation  to  Fig.  1 ! 

White  areas  in  Figs  2  to  4  are  below  the  zero  contour.  The  presence  of  such 
negativity  in  our  reconstructions  is  a  property  of  the  method  that  may  be  felt  to  be 
undesirable;  we  note,  at  least,  that  negativity  occurs  in  these  figures  only  outside  the 
head  region  where  there  are  no  emissions  in  reality.  Towards  the  outside  of  brain 
space,  some  increase  in  estimated  intensity  levels  is  an  edge  effect  which  should  be 
ignored. 
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6  Equivalent  Weight  Functions 

Many  density  /  intensity  estimation  methods  can  be  written  in  the  form  of  general 
weight  function  estimators  (e.g.  Silverman,  1986,  Section  2.9).  In  the  usual  case  where 
/is  obtained  from  direct  observations  {(r(-,0,)}  as  in  (2),  we  can  write 

'f(r, 9)  =  N-lZ^((riA)^r,9))  (9) 

i=  1 

where  the  weight  function  w  is  given  by 

w(  (/?,©), (r, 9) )  =  £  hv(/?,@)  rjv(r,d).  (10) 

viK 

When  such  direct  observations  are  available  from  the  density  of  interest,  the  weight 
function  expresses  how  a  particular  observation  is  smoothed  out  in  making  its  contri¬ 
bution  to  the  overall  estimate  and  hence  gives  insight  into  the  nature  of  the  smoothing 
process;  see,  for  example,  Silverman  (1984)  for  another  relevant  context 

Since  the  PET  observation  process  is  an  indirect  one,  some  modification  of  the 
above  discussion  is  necessary.  An  appropriate  alternative  definition  of  the  weight  func¬ 
tion,  equivalent  in  the  c-»se  of  direct  sampling,  is  as  an  "impulse  response  function". 
That  is,  suppose  that  the  true  image  consisted  of  a  point  mass  at  (R,&)  and  that  N 
indirect  observations  from  this  image  were  taken.  Then,  ignoring  the  tubecount  discre- 
.auon  and  with  the  degree  of  smoothing  held  fixed,  it  is  easily  shown  that  /(r,0) 
based  on  these  data  approaches  w  in  (10)  as  N  -»  °».  Thus,  w  remains  the  appropriate 
weight  function  to  study  in  the  case  of  indirectly  observed  data  too.  As  w  and  an  alter¬ 
native  version  of  w  which  properly  takes  the  data  discretisation  into  account  are  virtu¬ 
ally  indistinguishable,  we  have  not  incorporated  the  data  discretisation  modification 
here. 

To  make  more  of  the  above  we  present  some  pictorial  illustrations.  Fig.  5  shows 
w  for  R  =  0  and  0  =  0.  Again,  grey  scale  images  are  used  in  an  obvious  way  (although 
the  overall  scaling  of  the  pictures  in  this  section  differs  from  that  in  Section  5).  White 
areas  again  define  regions  of  negativity.  The  main  features  of  Fig.  5  are  the  spherical 
symmetry  of  the  weight  function  and  its  concentration  about  the  point  (0,0).  As  in  the 
familiar  kernel  estimation  approach,  tv  has  a  mode  at  the  point  of  interest  and  falls 
smoothly  away,  resulting  in  an  averaging  over  neighbouring  values  whose  influence 
becomes  less  as  their  distance  from  the  centre  increases.  Beyond  this  central  area,  w  is 
small  but  not  always  positive;  rather,  there  is  a  smooth  fluctuation  about  zero,  resulting 
in  a  series  of  low  positive  peaks  and  shallow  negative  troughs.  In  Fig.  5,  we  have 
taken  K  =  10.  Larger  values  of  K  smooth  less  by  narrowing  the  scope  of  the  main  part 
of  the  weight  function  and  thus  averaging  significantly  over  fewer  neighbouring  points. 

The  choice  ©  =  0  in  Fig.  5  is  quite  general;  w  is  rotation  equivariant  so  other  0’s 
result  simply  in  rotations  of  the  0  =  0  picture.  Different  R's  are  worthy  of  further 
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consideration,  though:  in  Fig.  6,  we  take  /?  =  0.5  and  in  Fig.  7,  R  =  0.9.  The  general 
pattern  of  a  peak  at  the  point  of  interest,  a  smooth  falling  away  of  w  in  a  neighbour¬ 
hood  of  the  point  and  the  small  positive  /negative  fluctuations  in  the  tails  persist  The 
weight  function  is,  of  course,  no  longer  essentially  spherically  symmetric  but  rather  is 
distorted  somewhat  in  a  way  consistent  with  fitting  w  appropriately  into  the  disc.  What 
is  important  about  Figs  3  to  7  is  that  the  amount  of  smoothing  (essentially  the  extent 
and  shape  of  the  area  in  which  w  is  significantly  nonzero)  does  not  differ  greatly  at 
different  points  in  brain  space.  Varying  degrees  of  smoothing  in  response  to  properties 
of  /  is  an  option  (not  considered  here)  that  may  well  be  desirable;  varying  degrees  of 
smoothing  purely  as  a  geometric  function  is  not 

7  Automatic  Choice  of  Smoothing  Parameter 

We  saw  in  Section  5  how  the  parameter  K  controls  the  level  of  smoothing  applied  to 
the  data.  Subjective  choice  of  smoothing  parameter,  as  there,  is  sufficient  in  many 
applications  of  smoothing  techniques  but,  in  PET  imaging,  a  fully  automatic  procedure, 
and  thus  an  automatic  method  for  choosing  K  appropriately,  might  well  be  thought 
desirable.  In  this  section,  we  illustrate  how  a  rather  natural  approach  to  choosing  the 
smoothing  parameter  in  orthogonal  series  density  estimation  in  general  adapts  to  the 
PET  case. 

Suppose  we  consider  the  mean  integrated  squared  error  (MISE)  to  be  an  appropri¬ 
ate  measure  of  discrepancy  between  /  and  /.  The  following  development  is  entirely 
analogous  to  the  Fourier  series  density  estimation  case  worked  out  in  Han  (1985)  and 
references  therein.  First  note  that 

JJ  \hr,e)-f{r,d)\2dn{r,e)  =  Z  b;2\gy-gv]z  +  £  K2\gv\2- 

viK  v>K 

A  A 

Now  ignore  the  tube  discretisation  for  the  moment  (i.e.  define  gv  like  fv  in  (2)  rather 
than  by  using  (8)),  so  that  E(gv)  =  gv  and  Var(gv)  =  N~laz  where  o}  = 
Vai(iyv(S,<&)).  Taking  expectations  in  the  above  expression,  we  get 

MISE  =  £&vT2(ATl<7,2-|gv|2)  +  jj\f(r,9)\2dv(r,d).  (11) 

v<,K 

The  value  of  K  that  minimises  this  MISE  is  a  candidate  for  being  a  good  choice  of  K 
for  the  PET  problem.  Of  course,  we  do  not  know  MISE  or  its  optimal  K.  Rather,  we 
drop  the  second  term  in  (11)  from  further  consideration  because  it  is  independent  of  K 
and  estimate  the  first  term,  l,  say,  as  best  we  can;  choosing  K  to  minimise  this  esti- 

A 

mate  (/)  yields  a  practical  procedure  which,  it  is  hoped,  comes  close  to  using  the  truly 
optimal  value  of  K.  It  is  not  difficult  to  show  that 

/  =  X  b;HN-\rx  [2sv  -(A/+1)  IIJ2} 

V<K 


(12) 
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is  an  unbiased  estimate  of  /,  so  it  is  this  formula  that  we  minimise.  In  (12),  sv  is  the 
sample  average  of  the  |  yrv  |  2’s.  In  practice,  we  are  stuck  with  the  discretisation  of 
detector  space,  so  we  use 

K  =  W-tXnr  t  Vv(st’<P:)\2 

t=  1 

and  § „  as  in  (8)  in  /. 

Since  K  is  an  integer,  it  is  straightforward  to  minimise  (12)  by  evaluating  it  over 
a  range  of  values  of  K\  (D  - 1)  is  an  upper  bound  to  this  range  due  to  an  aliasing 
effect  although  the  optimal  K  is  most  likely  to  be  much  less  than  this  anyway.  Doing 
this  for  the  simulated  example  yields  K  -  36  and  the  corresponding  figure  is  Fig.  3. 
This  is  the  image  we  preferred  in  Section  5  on  subjective  grounds.  Of  course,  on  the 
basis  of  this  one  example  only  we  make  no  great  claims  for  fhe  supremacy  of  our 
automatic  procedure.  For  one  thing,  there  is  always  scope  for  wayward  choices  due  to 
errors  in  estimating  the  MISE-optimal  K.  More  importantly,  the  propriety  or  otherwise 
of  MISE  as  risk  function  is  in  question.  It  is  widely  recognised  that  this  type  of  meas¬ 
ure  does  not  give  a  good  reflection  of  the  human  observer’s  sense  of  image  fidelity 
especially  when,  as  here,  the  true  image  contains  features  with  distinct  edges.  The  pro¬ 
vision  of  image  metrics  that  properly  reflect  visual  perception  remains  a  difficult  ques¬ 
tion;  see  Baddeley  (1987)  for  some  ideas.  We  persevered  with  the  MISE  development 
above  largely  on  grounds  of  tractability  but  are  encouraged  by  the  results:  it  is  to  be 
hoped  that  alternative  image  metrics  would  also  be  open  to  a  similar  kind  of  approach. 

Replacing  the  simple  cutoff  K  in  (4)  by  a  sequence  of  weights  { wv }  remains  an 
alternative  option  but  is  one  with  similar  problems  of  smoothing  parameter  choice. 
Johnstone  &  Silverman  (1988,  Section  7)  discuss  optimal  weight  sequences  for  MISE; 
these  are,  as  is  to  be  expected,  not  immediately  practicable  because  they  depend  on  the 
true  /.  Otherwise,  we  might  experiment  with  ad  hoc  weight  sequences;  the  formulae  in 
Wahba  (1981)  become  one  possibility.  These  have  not  been  pursued  here. 

8  The  Third  Dimension  Effect 

Photon  lines  are  in  reality  distributed  uniformly  in  3-dimensional  space,  not  just  in  the 
plane,  and  detectors  have  a  finite  depth,  d.  This  effect  of  the  third  dimension  is  not 
incorporated  into  the  reconstruction'  scheme  above  although  it  is  important  because  it 
persists  even  when  d  0  i.e.  our  2-dimensional  model  differs  from  the  limit  of  the  3- 
dimensionai  one.  It  turns  out  that,  to  a  good  approximation,  the  third  dimension  effect 
results  in  a  weighted  Radon  transform,  the  weight  factor  being  inversely  proportional 
to  the  length  of  the  detector  tube  (or,  at  least,  its  continuous  analogue);  see  Section 
4.1.4  of  Silverman  et  al.  (1988)  and  Section  10.2  of  Johnstone  &  Silverman  (1988)  for 
details. 


- 11  - 


We  have  not  yet  managed  to  modify  the  orthogonal  series  estimation  approach  to 
cope  with  this.  Rather,  here  we  demonstrate  the  considerable  effect  that  failure  to  do 
so  has  on  quality  of  image  reconstruction.  We  can  easily  simulate  data  from  the  phan¬ 
tom  of  Fig.  1  taking  the  third  dimension  into  account  by  adding  a  further 
acceptance  /  rejection  step  to  deal  with  the  inverse  length  bias;  in  fact,  the  resulting 
dataset  is  precisely  that  used  by  Silverman  et  al.  (1988)  in  their  simulation  example. 
Applying  the  current  (2-dimensional)  reconstruction  algorithm  (here  with  K  =  36)  to 
these  (3-dimensional)  data  gives  Fig.  8;  compare  this  with  Fig.  3  in  particular.  The 
third  dimension  effect  on  the  data  is  clear  a  smaller  proportion  of  emissions  occurring 
towards  the  centre  of  the  brain  space  will  be  detected  than  of  those  occurring  nearer  to 
the  edge.  The  consequences  for  the  2-dimensional  reconstruction  are  equally  clear: 
greater  intensities  are  attributed  to  outer  regions  than  should  be  the  case,  while  central 
areas  suffer  the  reverse  mistake. 

9  Discussion 

Tnat  the  orthogonal  series  intensity  estimation  approach  to  PET  image  reconstruction  is 
quick  compared  with  iterative  procedures  is  borne  out  by  the  approximately  30-fold 
improvement  in  computer  time  we  have  observed  in  comparison  with  the  best  EMS 
procedure  of  Silverman  et  al.  (1988).  That  it  also  suffers  in  comparison  in  terms  of 
important  image  quality  criteria  is  also  evident  in  at  least  three  major  ways: 

(0  The  smoothness  of  images  made  up  of  polynomials  is  not  consistent  with  the 
presence  of  edges  which,  we  argued  in  Section  5,  are  most  likely  to  be  an  impor¬ 
tant  feature  of  the  real  images  we  set  out  to  reconstruct. 

(II)  There  cannot  be  areas  of  the  brain  emitting  negative  numbers  of  positrons!  EM 
(Vardi  et  al.,  1985)  and  EMS  (Silverman  et  al.,  1988)  algorithms  naturally  result 
in  non-negative  reconstructions;  as  we  have  seen,  the  orthogonal  series  approach 
does  not. 

(HI)  The  Zemike  /  Chebyshev  polynomial  based  approach  is  appropriate  only  to  direct 
and  indirect  observation  spaces  being  linked  by  the  basic  Radon  transform.  Early 
in  Section  2  we  noted  the  many  modifications  to  this  transform  that  are  needed  to 
properly  model  the  practical  situation.  The  major  obstacle  to  use  of  the  orthogo¬ 
nal  series  approach  in  more  realistic  circumstances  is  the  need  to  obtain  the 
singular  value  decomposition  associated  with  the  correct  integral  transform.  Note 
that  for  EM-based  approaches,  it  is  only  necessary  (for  many  modifications)  to 
identify  the  right  transform  and  to  discretise  it  to  get  the  p(b,d)' s  of  Vardi  et  al. 
(1985). 

The  orthogonal  series  approach  has  further  advantages  as  well  as  disadvantages. 
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(i)  It  is  straightforward  to  understand  in  the  sense  that  it  is  a  fairly  direct  application 
of  a  well-known  technique. 

(ii)  That  there  is  no  need  to  discretise  brain  space  to  facilitate  reconstruction  is  partic¬ 
ularly  nice;  the  truly  continuous  nature  of  orthogonal  series  reconstruction  is,  con¬ 
ceptually,  most  appealing. 
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FIGURE  LEGENDS 


Fig.  1.  An  idealised  PET  image  within  a  circular  array  of  detectors.  Two  possible 
photon  lines  arising  from  an  emission  at  O  are  superimposed. 

Fig.  2.  Reconstruction  with  K  =  10.  • 

Fig.  3.  Reconstruction  with  K  -  36. 

Fig.  4.  Reconstruction  with  K  =  50. 

Fig.  5.  An  equivalent  weight  function  corresponding  to  R  =  0. 

Fig.  6.  An  equivalent  weight  function  corresponding  to  R  -  0.5. 

Fig.  7.  An  equivalent  weight  function  corresponding  to  R-  0.9. 

Fig.  8.  Reconstruction  ( K  =  36)  arising  from  data  incorporating  the  third  dimen¬ 
sion  effect. 
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Aggregation  and  refinement  in  binary  image  restoration. 


by 


M.  Jubb  and  C.  Jennison 


1.  Introduction 

Recent  developments  in  statistical  image  restoration  use  a  Bayesian  approach. 
One  observes  a  degraded  version  of  a  true  scene  after  the  addition  of  noise  and, 
possibly,  blurring.  If  the  degradation  process  and  noise  distribution  are  known,  the 
likelihood  of  the  record  can  be  combined  with  a  prior  probability  model  to  produce  a 
posterior  distribution  for  the  true  scene.  A  common  approach  is  then  to  seek  the 
maximum  a  posteriori  (MAP)  estimate  of  the  scene  and  present  this  as  the  restored 
image. 

For  computational  purposes  it  is  extremely  convenient  to  work  with  Markov 
random  field  (MRF)  models.  Under  a  MRF  model  the  scene  is  divided  into  pixels, 
each  of  which  can  take  a  single  colour  or  grey  level,  a  neighbourhood  structure  for  the 
pixels  is  specified  and  the  key  property  of  the  model  is  that  the  distribution  of  the 
colouring  of  any  pixel  is  conditionally  independent  of  all  other  pixels,  given  the 
colouring  of  its  neighbours. 

There  are  two  main  approaches  to  searching  for  the  MAP  estimate.  Geman  & 
Geman  (1984)  proposed  the  method  of  simulated  annealing.  They  have  shown  this  to 
be  a  versatile  and  effective  method  although  the  amount  of  computation  involved  is 
often  high.  Besag  (1986)  suggested  a  computationally  simpler  method  which  he  refers 
to  as  the  method  of  iterated  conditional  modes  (ICM).  This  method  will  normally 
converge  to  a  local  rather  than  global  maximum  of  the  a  posteriori  likelihood; 
however,  convergence  is  rapid  and,  given  the  approximate  nature  of  the  MRF  model, 
failure  to  find  the  global  maximum  may  not  be  a  serious  drawback. 

Jennison  (1986)  and  Jennison  &  Jubb  (1987)  have  shown  that  the  same  form  of 
MRF  model  can  be  used  to  obtain  restorations  of  an  image  with  detail  at  a  finer  level 
than  the  pixel  grid  on  which  records  are  observed.  In  their  original  examples  the  noise 
level  was  very  low.  The  work  reported  in  this  paper  grew  out  of  an  investigation  into 
the  use  of  "refinement"  methods  in  the  presence  of  greater  noise:  the  main  problem  in 
this  case  is  to  find  a  good  starting  point  for  the  refinement  algorithm.  In  some  of  our 
exploratory  examples  we  discovered  that  the  ICM  method  itself  experienced  serious 
difficulties  at  very  high  noise  levels.  One  solution  to  this  problem  is  to  increase  the 
signal  to  r.oise  ratio  by  aggregating  the  records  of,  say,  each  2  by  2  block  of  pixels 
into  a  single  record:  satisfactory  results  were  obtained  by  applying  ICM  to  the 
aggregated  signal  and  then  using  the  resulting  restoration  as  the  starting  point  for  ICM 
on  the  original  pixel  grid.  A  natural  extension  of  this  idea  is  a  "cascade”  algorithm, 
similar  to  that  of  Gidas  (1989),  which  produces  restorations  on  successively  finer  pixel 
grids,  starting  with  a  single  large  pixel  and  ending  with  the  original  grid.  We  have 
found  that  this  approach  provides  a  simple  and  efficient  way  of  adapting  the  ICM 
method  to  very  noisy  data.  It  also  solves  the  refinement  problem,  since  the  end 
product  of  this  algorithm,  or  even  a  restoration  based  on  aggregated  data,  will  provide 
a  good  stoning  point  for  the  refinement  process. 

Our  intention  in  this  paper  it  to  follow  the  ICM  approach  as  much  as  possible. 
There  are  severai  places  where  simulated  annealing  might  be  incorporated  but  it  would 
require  substantially  more  computing,  and  there  is  no  guarantee  that  it  would  provide 
better  results.  The  main  advantage  of  simulated  annealing  is  that  it  allows  one  to 
escape  from  a  local  maximum  of  the  posterior  likelihood  by  a  process  of  trial  and 
error,  however,  use  of  the  cascade  algorithm  to  choose  a  good  starring  point  for  the 
deterministic  ICM  algorithm  may  be  just  as  effective.  We  do  introduce  a  version  of 
simulated  annealing  to  implement  the  refinement  method  of  Section  5.  Although  this 
provides  a  very  convenient  way  of  exploring  a  larger  set  of  restorations,  its  impact  on 


the  final  restored  image  for  our  example  is  slight. 

Some  comment  on  the  role  of  the  prior  model  for  the  true  scene  is  called  for. 
Gidas  (1989)  goes  to  great  lengths  to  ensure  that,  in  his  cascade  algorithm,  the  models 
at  different  pixel  sizes  are  mutually  consistent.  We  are  not  committed  to  a  single 
model  and  will  be  happy  as  long  as  tne  final  restoration  is  a  good  one.  It  should  also 
be  remembered  that  all  that  we  require  of  the  end  product  of  one  stage  of  the  cascade 
algorithm  is  that  it  should  provide  a  good  starting  point  for  the  next  We  do  not 
assume  that  we  have  a  global  MAP  estimate  at  any  stage,  nor  do  we  try  to  make  use 
of  such  a  property. 

We  shall  use  a  single  illustrative  example  throughout  the  paper.  In  the  original 
image  the  boundaries  of  objects  are  smooth  in  parts -but  irregular  in  other  places  and 
certain  features  are  extremely  difficult  to  restore  given  the  level  of  noise  in  the  data. 
Thus,  the  example  shows  both  the  power  of  the  proposed  method  and  its  limitations. 

2.  Model  and  notation 

We  first  consider  a  rectangular  region  partitioned  into  pixels  labelled  1,2,. ..,n. 
Each  pixel  is  coloured  black  or  white  and  the  colour  of  pixel  i  is  denoted  by  x*  which 
takes  the  value  0  for  white  and  1  for  black.  The  x *  are  unobserved.  It  is  assumed  that 
the  conditional  density  function  /(y(  |.t*)  is  known  and  for  the  remainder  of  this  pare' 
we  shall  assume  that  the  records  yt .  are  independently  distributed  as  Gaussian  with 

mean  x'  and  variance  a2.  The  set  of  records  is  denoted  by  y  =  (y(-;  1=1 . /i } .  A 

colouring  of  pixel  1  (not  necessarily  the  true  colouring,  .t * )  is  denoted  by  x ,  and  a 
specific  colouring  of  the  whole  region  is  denoted  by  x  =  (jq; 

In  the  MRF  model  for  the  true  scene  we  shall  use  a  neighbourhood  system  in 
which  pixels  are  considered  to  be  first  order  neighbours  if  they  are  horizontally  or 
vertically  adjacent  to  each  other  and  second  order  neighbours  if  they  are  diagonally 

adjacent.  In  our  model,  the  prior  distribution  for  the  true  scene,  p(x),  is 

p(x)  «  exp[-(P,Z,(.r)+P2Z2U)}],  (2.1) 

where  Zj(x)  is  the  number  of  discrepant  first  order  pairs  in  the  scene  x ,  i.e.  the 
number  of  pairs  of  first  order  neighbours  which  are  of  opposite  colour,  Z^fj:)  is  the 
number  of  discrepant  second  order  pairs  and  j3j  and  (32  are  fixed  positive  constants. 

The  MAP  estimate  of  the  true  scene  is  the  value  of  x  which  maximises  P(x\y), 
the  conditional  probability  of  x  given  the  record  y.  By  Bayes’  theorem 

P(x\y)  «  l(y\x)p{x),  '  (2.2) 

where  l(y\x)  is  the  conditional  likelihood  of  the  observed  record  y,  given  the  true 
colouring,  x,  and  p(x)  is  the  prior  probability  of  x.  Thus,  the  maximisation  of  P(.r|y) 
corresponds  to  the  minimisation  of 

-—jZO’i--*;)2  +  [PiZ1(.x)+-p2Z2(^)j,  (2.3) 

over  values  of  x  =  j.x,;i=l,...,n) . 

Besag’s  (1986)  method  of  iterated  conditional  modes  updates  each  pixel  in  turn, 
choosing  for  it  the  most  likely  colour  based  on  its  record  and  the  current  colouring  of 
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its  neighbours,  i.e.,  minimising  (2.3)  with  respect  to  Xj  with  all  the  other  pixel 
colourings  fixed.  The  expression  in  (2.3)  must  decrease  or  remain  constant  at  each 
updating  but  convergence  will  usually  be  to  a  local  minimum.  We  shall  see  later  in 
this  paper  that  the  choice  of  the  initial  colouring  can  have  a  great  influence  on  the 
accuracy  of  the  final  restoration.  Throughout  this  paper,  when  ICM  is  applied,  a 
second  order  neighbourhood  system  will  be  used  with  (32=Pi/V2;  this  ratio  of  Pt  to  P2 
minimises  the  rotational  variance  of  the  second  term  of  (2.3)  with  respect  to  the 
positioning  of  the  pixel  grid  on  a  given  scene  (see  Brown,  Jennison  and  Silverman, 
1987). 

In  the  above  model  for  the  true  scene  it  is  assumed  that  each  pixel  is  coloured 
wholly  black  or  white.  This  is  at  best  an  approximation:  more  generally,  one  might 
expect  pixels  on  the  boundary  of  an  object  to  contain  areas  of  each  colour,  in  which 
case  the  record  y,  will  be  distributed  as  Gaussian  with  variance  cr2  and  mean  equal  to 
the  proportion  of  pixel  i  coloured  black.  Although  we  shall  consider  problems  in  which 
there  is  a  general  true  scene,  we  start  by  considering  restorations  based  on  a  discrete 
MRF  model  in  which  each  pixel  has  a  single  colour.  The  refinement  method  described 
in  Section  5  does,  however,  allow  boundary  pixels  to  be  coloured  partly  black  and 
partly  white. 


3.  An  example 


Figure  1.  The  true  scene. 


An  example  of  a  binary  scene  containing  two  separate  objects  is  shown  in  Figure 
1.  A  256  by  256  pixel  grid  was  superimposed  on  this  scene  and  the  proportion.  ph  of 
black  in  pixel  i  was  calculated  for  each  pixel.  The  record  y\  was  obtained  by  adding 
Gaussian  noise  with  variance  4  to  this  proportion,  pi  .  Figure  2  shows  the  closest 
mean  classifier  for  this  record,  in  which  a  pixei  is  coloured  black  if  its  record  is 
greater  than  0.5  and  white  otherwise.  One  would  not  normally  hope  to  recover  an 
image  which  has  been  exposed  to  such  a  large  amount  of  noise  and  Figure  3  shows  the 
rather  unsatisfactory  restoration  obtained  by  applying  ICM  with  [3,  =4.  The  value  [3]  =4 
is  unusually  high  but  we  found  this  to  give  the  best  results.  (Note  that  even  if  (3,  — ►as¬ 
certain  configurations  of  pixels  remain  unsmoothed.) 

The  major  problem  in  our  example  is  the  low  signal  to  noise  ratio.  This  ratio 
may  be  improved  by  aggregating  the  record,  i.e..  by  replacing  sets  of  2  by  2  pixels  bv 
a  single  large  pixel  with  record  equal  to  the  average  of  the  original  four.  This  also 
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corresponds  to  viewing  the  original  image  on  a  coarser  grid.  The  variance  of  the  new 
record  is  one  quarter  that  of  the  original  but  the  range  of  the  /?,  ’s  is  still  [0,1];  thus 
there  is  a  substantial  increase  in  the  signal  to  noise  ratio.  The  restoration  shown  in 
Figure  4  was  obtained  by  applying  ICM  to  the  aggregated  record;  the  prior  model  for 
the  true  scene  had  the  same  form  as  (1.1)  but  was  applied  to  larger  pixels,  the  value 
(3j  =4  was  also  used  here  as  it  was  found  to  give  the  best  results.  The  clear  superiority 
of  this  restoration  to  that  shown  in  Figure  3  demonstrates  the  advantage  of  working 
with  the  aggregated  record.  One  explanation  of  the  success  of  this  restoration  process 
is  that  it  allows  the  ICM  algorithm  to  look  further  afield  when  gathering  neighbour 
information:  ICM  on  the  original  pixel  grid  can  easily  be  trapped  in  a  local  maximum 
of  the  a  posteriori  likelihood  when  only  one  pixel  is  allowed  to  change  at  a  time. 

Repeating  the  aggregation  process  gives  the  restorations  shown  in  Figures  5  and 
6,  which  are  the  restorations  at  two  and  three  levels  of  aggregation  respectively.  These 
restorations  were  obtained  using  (3j=l,  a  more  typical  value,  which  we  have  found 
gives  good  results  in  cases  where  the  signal  to  noise  ratio  is  moderate.  Note  that  the 
computational  time  and  storage  requirements  for  the  processing  of  a  32  by  32  image 
are  approximately  £  times  those  needed  to  process  a  256  by  256  image. 

So  far,  we  have  followed  Besag’s  method  and  used  the  closest  mean  classifier  as 
our  initial  colouring  for  the  256  by  256  case  and  this  is  partly  responsible  for  the  poor 
quality  of  the  restoration  in  Figure  3.  A  better  initial  colouring  might  be  the  final 
restoration  obtained  from  an  aggregated  record.  Figure  7  shows  the  result  of  using 
Figure  5  as  the  initial  colouring  for  ICM  on  the  256  by  256  grid  with  =4;  a  similar 
result  is  obtained  with  Pt  =  l.  The  superiority  of  this  restoration  to  that  of  Figure  3 
demonstrates  the  influence  of  the  initial  colouring  on  the  resulting  image. 

The  method  of  simulated  annealing  is  less  dependent  on  the  initial  colouring, 
since  it  can  progress  from  one  local  minimum  of  (2.3)  to  another  whilst  passing 
through  higher  intermediate  values.  Thus,  simulated  annealing  is  able  to  search  at  least 
a  little  further  afield. than  the  myopic  ICM  strategy.  An  advantage  of  using  an 
aggregation  procedure  is  that  it  allows  the  ICM  approach  to  use  more  distant 
neighbour  information  whilst  maintaining  its  computational  speed. 


4.  The  Cascade  Algorithm 

In  the  previous  section  we  introduced  the  idea  of  using  the  restoration  obtained 
from  an  aggregated  record  as  the  initial  colouring  for  restoration  on  a  finer  scale.  We 
now  extend  this  idea  to  define  a  "cascade''  algorithm  in  which  restorations  obtained 
from  2m  by  2m  grids  are  used  as  the  initial  colourings  for  restorations  on  2m+l  by 
2'"+I  grids.  A  single  pixel  restoration  is  obtained  by  aggregating  the  record  until  it  is 
one  pixel  in  size:  this  is  then  used  as  the  initial  colouring  for  the  ICM  method  on  the  2 
by  2  grid.  This  restoration  is  in  turn  used  as  the  initial  colouring  for  ICM  on  the  4  by 
4  grid  and  we  continue  in  this  way,  obtaining  restorations  right  up  to  the  level  of  the 
original  record.  Tne  last  six  in  the  series  of  restorations  for  our  example  are  shown  in 
Figures  8-13;  the  value  Pt  =  l  was  used  at  each  level,  though  it  is  interesting  to  note 
that  using  higher  values  at  the  128  and  256  levels  made  virtually  no  difference  to  the 
image  obtained. 

The  method  of  Gidas  (1989)  is  very  similar  to  the  procedure  we  have  just 
described.  However,  Gidas  uses  a  single  MRF  model  defined  on  the  finest  pixel  grid 
and  employs  the  "renormalization  group"  approach  to  compute  the  models  implied  for 
coarser  grids.  Both  the  complexity  of  the  models  at  the  aggregated  levels  and  the  use 
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of  simulated  annealing  at  each  stage  makes  this  a  computationally  demanding  method. 
We  have  tried  to  keep  computation  to  a  minimum  at  the  expense  of  a  less  rigorous 
treatment  of  the  prior  model:  given  the  approximate  nature  of  this  model,  we  would 
argue  that  this  is  not  unreasonable. 

One  might  at  least  try  to  develop  theoretical  arguments  to  produce  a  "correct" 
sequence  of  values  of  p1  for  use  at  different  stages  of  the  cascade  algorithm.  Brown, 
Jennison  and  Silverman  (1987)  interpret  the  second  term  of  (2.3)  as  a  penalty  and 
suggest  that  it  should  be  chosen  to  be  approximately  independent  of  the  pixel  grid 
superimposed.  They  suggest  that  this  penalty  should  approximate  a  constant  multiple 
of  the  total  boundary  length  in  the  image.  In  our  application  this  would  imply  that  the 
parameter  Pj  be  halved  as  the  pixel  sizes  are  quartered  but  we  have  not  found  this  to 
be  very  successful  in  practice.  Using  the  same  value  of  at  each  stage  produced 
substantially  better  results. 

When  processing  the  larger  images  we  avoid  unnecessary  computations  by  storing 
the  coordinates  of  pixels  whose  colourings  have  changed  in  the  current  iteration.  If  the 
number  of  these  is  small,  only  pixels  whose  neighbours  have  changed  colour  in  the 
last  iteration  are  considered  for  updating  in  the  next  iteration.  For  each  of  the  images 
shown  in  Figures  8-13  one  complete  iteration  plus  some  minor  changes  was  all  that 
was  required.  Summing  a  geometric  series,  we  see  that  the  total  computation  required 
is  approximately  equivalent  to  1  {  iterations  of  ICM  on  the  finest  pixel  grid. 

We  have  seen  that  the  restorations  obtained  on  the  finer  grids  have  been 
insensitive  to  the  choice  of  Pj.  This  is  partly  attributable  to  the  high  noise  level 
(updating  is  essentially  by  the  “majority  vote  rule"  at  quite  low  values  of  pj)  but  also 
suggests  that,  for  a  given  image,  restoration  at  too  fine  a  pixel  level  is  unnecessary, 
adding  only  computation  and  superfluous  detail  tc  what  is  already  a  satisfactory 
restoration.  We  are  able  to  make  a  direct  comparison  of  restorations  obtained  at 
different  levels  of  aggregation  by  superimposing  the  finer  grid  on  the  coarser  image 
and  calculating  penalties  for  both,  based  on  the  finer  record  and  the  MRF  model  at  that 
level.  The  coarser  image  is  disadvantaged,  since  it  was  chosen  when  searching  for  the 
minimum  of  a  different  penalty.  We  measure  the  benefit  of  restoring  at  the  finer  level 
by  the  percentage  decrease  in  the  penalty.  The  values  are  tabulated  below. 


Grid  size 

Grid  size 

percentage 

of  coarse 

of  fine 

reduction 

restoration 

restoration 

in  penalty 

2x2 

4x4 

68.1 

4x4 

8x8 

75.8 

8  x  8 

16  x  16 

49.6 

16  x  16 

32  x  32 

21.5 

32  x  32 

64  x  64 

5.2 

64  x  64 

'  128  x  128 

1.6 

128  x  128 

256  x  256 

0.6 

Analysis  of  these  values  is  purely  subjective  but  appears  to  suggest  that  the  64  by  64 
level  is  satisfactory.  Inspection  of  Figures  8-13  also  leads  to  the  same  conclusions. 


5.  Subpixei  refinement 

So  far  the  restoration  techniques  we  have  used  have  coloured  each  pixel  wholly 
one  colour,  even  though  pixels  on  the  edges  of  objects  in  the  true  scene  may  be  partly 
black  and  partly  white.  We  now  consider  techniques  which  allow  both  colours  to 
appear  in  a  single  pixel.  Jennison  (1986)  used  a  modification  of  the  ICM  method  to 
obtain  a  restoration  in  which  each  pixel  was  divided  into  4  subpixel  quarters  and  a 
separate  colour  allocated  to  each  subpixel.  His  method  used  the  ICM  restoration  at  full 
pixel  size  as  a  starting  point  for  restoration  at  the  subpixel  level.  The  success  of  this 
technique  prompted  Jennison  and  Jubb  (1987)  to  consider  the  further  refinement  of 
pixels. 

Since  the  number  of  different  colourings  of  a  pixel  grows  exponentially  with  the 
number  of  subpixels,  the  extension  of  Jennison’s  method  to  a  finer  subdivision  of  each 
pixel  is  computationally  prohibitive.  However,  the  limit  of  this  process,  in  which  an 
arbitrary  colouring  of  each  pixel  is  allowed,  can  be  made  tractable.  Rather  than 
specify  a  MRF  modei  for  the  true  scene  we  interpret  the  minimisation  of  (2.3)  as  a 
form  of  penalised  maximum  likelihood.  The  second  term  of  (2.3)  is,  approximately,  a 
multiple  of  the  total  boundary  length  in  the  image,  x.  Thus,  an  analogous  penalty  for  a 
general  restoration,  x,  is 


-rZO'rftW)2  +  PtW.  (5.1) 

2cr  i  =1 

where  p,(x)  denotes  the  proportion  of  black  in  pixel  j,  L(x)  is  the  totai  edge  length  in 
scene  x  and  (5  is  a  fixed  constant.  For  computational  simplicity  we  restrict  attention  to 
restorations  in  which  pixels  are  either  of  a  single  colour  or  are  separated  into  areas  of 
different  colour  by  a  single  straight  line  with  the  line  segments  defining  such  areas  in 
adjacent  pixels  meeting  at  a  point. 

A  black  and  white  image  can  be  regarded  as  a  series  of  line  segments  separating 
the  two  colours.  Jennison  and  Jubb  (1987)  use  the  restoration  obtained  from  Jennison’s 
quarter  pixel  method  is  used  an  initial  representation  for  the  line  segments.  The 
updating  process  treats  pixels  in  pairs,  selecting  the  best  place  for  two  edges  to  meet, 
given  the  current  restoration  of  neighbouring  pixels.  We  repeat  the  details  for 
completeness. 


Figure  14.  Updating  the  position  of  edges  in  pixels  i  and  j. 

As  an  example,  consider  the  configuration  at  pixels  i  and  j  shown  in  Figure  14. 
The  distances  a  and  b  are  determined  by  the  current  colouring  of  neighbouring  pixels 
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and  treated  as  constant  for  the  moment.  The  distance  W  is  chosen  to  minimise  the 
contribution  from  pixels  i  and  j  to  the  total  penalty  (5.1),  i.e. 

g(W)  =  — £  (yk  -  pkw)2  +  +  ejw)<  (5-2) 

2<r  k=i.j 

where  ekW  is  the  length  of  edge  in  pixel  k  when  the  join  is  at  W  and  pkw  is  the 
proportion  of  black  in  pixel  k  when  the  join  is  at  W. 

For  the  case  shown  in  Figure  14,  this  penalty  is 

gi(W)  =  J^{(yi-a-±(W-a))2+(yrb-±(W-b))2) 

+  p{Vl+(W-a)2+Vi+(W'-&)2}. 


This  can  not  be  minimised  directly  but  the  form  of 


dg]  (IV)  i 

_L_  .  —aw+a.2yt+b.2yj)  +  p 


(W-a)  ( W-b ) 


_Vl  +(W-a)1  'h+(W-b)2 

suggests  an  iterative  approach.  Given  an  approximate  cuiudon  Ws_x  we  solve 

(Ws-a)  ( Ws~b ) 


1 

4<r 


to  obtain 


■(2Ws+a-2yi+b-2yj)  +  (3 


Vl+(V^_i-a)2  yJlHW^-b)2 


=  0 


IV,  = 


4a2  p 

a 

b 

■  +  — — — — — 

+  ( ly-a+lyj-b ) 

_Vl  hws_ 

i  -a)1  Vl+O^-t  ~b)2 

2+4<j2P 

1 

- u 

1 

_Vl  +(V^-,-a)2  Vl+(V^_,-Z>)2_ 

Starting  from  any  sensible  initial  value,  IV^,  accuracy  to  3  decimal  places  was 
achieved  after  at  most  four  iterations.  In  practice  we  take  WQ  to  be  the  value  of  W 
prior  to  this  update. 

Different  forms  of  (5.2)  are  possible  depending  on  which  neighbours  of  pixels  i 
and  j  contain  both  colours.  There  are  only  four  distinct  cases  that  may  arise  and  these 
are  shown  in  Figure  15. 

We  have  shown  the  method  of  solution  for  case  (i);  cases  (ii)  -  (iv)  are  solved  in 
a  similar  way.  All  other  cases  can  be  reduced  to  one  of  the  above  by  means  of 
exchanging  and/or  inverting  the  pixels  and  their  colours.  The  edge  pixels  are  updated 
in  turn,  following  an  edge  around,  completing  circuits  of  the  edge  until  convergence. 
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Fig.  15.  Possible  configurations  of  edges  in  two  neighbouring  pixels. 


The  complete  restoration  algorithm 

We  can  now  combine  both  aggregation  and  refinement  into  a  three  stage 
algorithm: 

Stage  1:  Apply  the  cascade  algorithm  using  ICM  on  the 

aggregated  records  -up  to  a  suitable  point.  The  record  is 
now  fixed  at  this  level  and  no  further  use  will  be  made  of 
the  original  record.  (If  the  record  is  still  aggregated  at 
this  level  substantial  savings  in  computation  will  result.) 
avoiding  unnecessary  computation. 

Stage  2:  Iterate  Jennison’s  quarter  pixel  refinement  to  convergence. 

This  is  very  quick  and  supplies  a  good  starting  point  for 
the  line  fitting  process. 

Stage  3:  Apply  the  line  fitting  algorithm  to  convergence. 

A  development  in  the  line  fitting  algorithm 

In  the  line  fitting  algorithm  described  by  Jennison  and  Jubb  (1987)  the  route  that 
the  lines  take  through  pixel  edges  is  determined  once  and  for  all  by  the  restoration 

- -4  •>  *•  «ivp|  1  awa  I 

UUMiuww  wi»V  kv>  £...  It  vVl. 

We  have  now  extended  the  algorithm  to  allow  changes  in  this  route.  Each  time  the 
point  at  which  the  edge  crosses  a  pixel  boundary  is  updated  an  alternative  route  is 
compared.  A  number  of  cases  have  to  be  treated  separately;  three  qualitatively 
different  configurations  are  shown  in  Figure  16. 
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Fig.  16.  Examples  of  configurations  at  which  alternative  routes  are  considered. 

The  contribution  to  the  total  penalty  from  all  four  pixels  is  calculated  for  each  of  the 
two  routes  with  line  edges  chosen  optimally  for  that  route.  In  the  basic  method,  the 
route  which  has  smallest  penalty  is  then  chosen.  . 


Figure  17.  Figure  18. 


Figures  17  and  18  show  the  restorations  obtained  from  applying  the  line  fitting 
method  to  the  aggregated  record  in  the  example.  In  Figure  17  the ‘grid  size  is  32  by 
32  and  in  Figure  18  it  is  64  by  64.  In  the  previous  section  we  suggested  that  a  grid 
size  of  64  by  64  would  be  sufficient  and  the  restoration  shown  in  Figure  18  is  indeed 
satisfactory.  In  both  cases  we  used  P;  =  l  at  the  ICM  and  quarter  pixel  levels  of 
restoration  and  P=4  for  the  line  fitting. 

The  updating  process  in  the  above  line  fitting  procedure  has  the  general 
characteristics  of  an  ICM  method:  the  penalty  (5.1)  is  minimised  with  respect  to  one 
component  of  the  boundary  whilst  everything  else  is  held  fixed.  This  method  will 
generally  yield  a  local  minimum  of  (5.1)  and  it  is  possible  that  the  final  restoration 
could  be  improved  further  by  making  a  number  of  route  changes  simultaneously.  For 
example,  the  penalty  (5.1)  might  be  reduced  by  moving  a  long  vertical  edge  one  pixel 
to  the  left  whereas  it  would  increase  initially  if  only  one  route  change  were  made  at  a 
time. 

To  allow  further  exploration  of  alternative  routes  we  have  implemented  a  form  of 
simulated  annealing.  This  method  retains  the  property  that  for  a  given  route  the  point 
on  a  pixel  edge  at  which  two  line  segments  meet  is  chosen  optimally.  However,  when 
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comparing  the  minimum  penalties  for  different  routes  we  allow  the  route  with  the 
larger  penalty  to  be  chosen  with  non-zero  probability.  Suppose  two  routes,  A  and  5, 
have  minimum  penalties  penA  and  penB,  then,  when  the  annealing  process  is  at 
temperature  T  we  select  route  A  and  its  optimal  edges  with  probability 

e<.-penA/T) 

g  ( -pen*  ft )  +  g  ( ~penBft ) 

otherwise  we  choose  route  B.  Of  course,  only  the  contribution  to  the  total  penalty 
from  the  four  pixels  concerned  need  actually  be  calculated. 

By  restricting  the  random  choice  to  the  route  alone,  we  ensure  that,  effectively, 
the  annealing  process  is  applied  to  a  fairly  low  dimension  problem,  the  number  of 
variables  being  of  the  order  of  the  number  of  boundary  pixels.  Theorem  B  of  Geman 
and  Geman  (1984)  demonstrates  the  convergence  of  their  simulated  annealing  method. 
In  its  stated  form,  this  theorem  does  not  apply  to  our  hybrid  procedure  whose  iterative 
steps  combine  a  random  choice  of  route  with  a  deterministic  choice  of  edges  given  that 
route  and  currently  fixed  end  points.  Perhaps  a  sufficiently  general  result  could  be 
proved  but  this  would,  presumably,  still  only  apply  for  gentle  cooling  schedules. 
However,  we  prefer  to  think  of  the  annealing  method  simply  as  a  convenient  numerical 
procedure  which  searches  a  little  further  afield  than  the  ICM  approach. 

We  have  experimented  with  a  variety  of  cooling  schedules  for  our  example  using 
the  aggregated  record  at  both  the  32  by  32  and  64  by  64  grid  levels.  The  best  results 
were  obtained  using  a  cooling  schedule  in  which  T  decreased  logarithmically  from  3 . 5 
to  0.5  over  several  hundred  sweeps  and  linearly  from  0.5  to  zero  over  several  hundred 
more.  We  then  continued  to  update  using  7*=0  until  convergence,  which  usually 
required  only  a  few  sweeps.  Although  simulated  annealing  often  produced  a  lower 
penalty,  the  restoration  produced  was  never  visually  superior  to  that  obtained  using  the 
local  maximisation  procedure. 

Our  conclusion  is  that  the  starting  point  provided  by  the  cascade  algorithm  was 
sufficiently  good  that  the  deterministic  line  fitting  algorithm  was  very  nearly  optimal. 

6.  Concluding  Remarks. 

Combining  the  line  fitting  procedure  with  the  cascade  algorithm  has  produced  a 
fast  and  effective  method  for  obtaining  a  high  quality  restoration  from  noisy  data. 
Further  work  is  required  to  provide  an  automatic  choice  of  suitable  values  of  (3l  at 
different  grid  levels  and  a  criterion  for  terminating  the  cascade  algorithm  at  the  most 
appropriate  level  of  aggregation.  Although  we  have  considered  only  two-colour 
images  in  this  paper,  it  is  clear  that  the  basic  ideas  are  more  generally  applicable:  we 
hope  to  continue  work  on  the  development  of  an  aggregation  and  refinement  algorithm 
for  grey  level  images. 
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